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Abstract 

In learning problems available information is usually divided into two categories: examples of function 
values (or training data) and prior information (e.g. a smoothness constraint). 

This paper 1.) studies aspects on which these two categories usually differ, like their relevance for gen- 
eralization and their role in the loss function, 2.) presents a unifying formalism, where both types of 
information are identified with answers to generalized questions, 3.) shows what kind of generalized in- 
formation is necessary to enable learning, 4.) aims to put usual training data and prior information on 
a more equal footing by discussing possibilities and variants of measurement and control for generalized 
questions, including the examples of smoothness and symmetries, 5.) reviews shortly the measurement of 
linguistic concepts based on fuzzy priors, and principles to combine preprocessors, 6.) uses a Bayesian de- 
cision theoretic framework, contrasting parallel and inverse decision problems, 7.) proposes, for problems 
with non-approximation aspects, a Bayesian two step approximation consisting of posterior maximization 
and a subsequent risk minimization, 8.) analyses empirical risk minimization under the aspect of non- 
local information 9.) compares the Bayesian two step approximation with empirical risk minimization, 
including their interpretations of Occam's razor, 10.) formulates examples of stationarity conditions for 
the maximum posterior approximation with nonlocal and nonconvex priors, leading to inhomogeneous 
nonlinear equations, similar for example to equations in scattering theory in physics. 

In summary, this paper focuses on the dependencies between answers to different questions. Because 
not training examples alone but such dependencies enable generalization, it emphasizes the need of their 
empirical measurement and control and of a more explicit treatment in theory. 
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1 Introduction 

To clarify the aim of the paper, our use of the term "prior 
information" and to define vocabulary and notation we 
analyze the fundamental problem of generalization for a 
simple toy example: 

Let us assume that we are interested in the answers 
yi(xi) and 2/2(^2) to two questions x\ and a?2- We will 
call these questions relevant to our application. Let us 
assume we have found for y\ the value y\ — 2 in a first 
measurement. We are interested in the future output 
when measuring x\ and a?2- Clearly this is not possi- 
ble without further assumptions. These assumptions are 
used in choosing a set F° of 'possible states of nature' /° . 
If we assume that the problem is deterministic, then the 
problem is already solved for x\. We call this assumption 
local as they refer to a single question. If the problem is 
probabilistic and we want to determine the probability 
of future answers y\ to x\ this requires the assumption of 
a set of stationary distributions p(yi|a?i, /°) correspond- 
ing to possible states /° for y\ so we can use the data to 
reweight the probabilities p(/°). The collection of p(/°) 
is what we call a 'state of knowledge' and can be used 
to predict future outcomes y\. 

But in most real world problems there are new ques- 
tions X2 for which no training examples are available. 
This is the generalization problem. It seems even harder 
than the learning problem for x\ for which at least local 
data are available. To remedy the situation we clearly 
need answers which depend on 2/2 at least indirectly. We 
may directly use a nonlocal assumption like 2/2 = y + Vi- 
This is trivial if we know y. But this may not be the 
case in the beginning and this paper analyses how infor- 
mation about y may be obtained. Therefore, we want 
to start with independent relevant outcome y\ and 2/2 
(what we will call a factorial state) and try to relate all 
dependencies to additional information. That means we 
see y as an answer to a question q with y(q) — 2/2 — Vi • 
This nonlocal q question represents a measurement de- 
vice for differences. We call questions depending on more 
than one yi generalized questions. Examples are nonlocal 
questions like 2/2 — Vi or questions referring to repeated 
measurements of the same questions like y\ + y[ . We can 
now separate the nonlocal information in two parts: the 
structural information, that is the definition of the non- 
local question q and an empirical or controllable part, 
that is the result of measuring q or controlling its an- 
swer. The definition of q alone does not yet determine 
the value or probability of 2/2 • Only an actual measure- 
ment of q relates those two questions. 

This is a conceptually clear starting point to analyze 
generalization. We start with an unrelated set of relevant 
questions X{ in which we are interested and formulate all 
nonlocal information as answers to nonlocal questions. 
How can we know what q measures? Clearly the defini- 
tion of 2, that is the structural information, will contain 
some assumptions, specifically some kind of stationarity. 
The definition of q is usually also based on previous em- 
pirical information. For example, we could have tested 
a difference device many times before using it and found 
it working correctly (or at least indicating correctly its 
failure). Then the stationarity assumption that it will 



work correctly also for our task, seems reasonable. (In 
a Bayesian approach we can give it a high subjective 
probability.) Thus, structural information is based on 
transfer of knowledge between tasks. Transfer means 
generalization with respect to a task variable. In statis- 
tics a nonlocal question corresponds for example to the 
choice of a smoothness functional to be used as prior. 
Likewise, rules in an expert system with dependencies 
modeled by logic, fuzzy logic or Bayesian belief networks 
and macroscopic variables (e.g. energy) in physics are 
special cases of nonlocal questions. 

Data are pairs of question and answer and we refer 
to the data related to relevant questions a?i, X2 as test 
data and to data related to questions a?i, q with answers 
available as training data. Then for independent test 
questions the training data necessarily have to contain 
answers to nonlocal questions to allow generalization. It 
is from this point of view that we want to analyze prior 
information: we start with independent relevant ques- 
tions and try to give explicit explanations, namely mea- 
surable prior information, for their dependencies. We 
summarize three aspects we have been mentioning 

1. Generalization aspect: not all relevant questions 
are available for training, 

2. Application aspect: not all available data corre- 
spond directly to relevant questions, 

3. Transfer aspect: Some aspects of the setting, like 
the nature of the measuring devices, have to be 
known. 

Now let us switch to more realistic statistical problems 
and consider a given function / : y = f(x) with x E X 
and y £zY for some set X and some set Y . We allow as 
response to x instead of a deterministic f(x) a random 
variable which takes values y with probability p(y\x, /). 
Let us consider the following two kinds of information: 
A typical training example, that is a pair (x,y), and 
some smoothness constraint. Let the latter be given by 
a bound on a smoothness functional, like, for example, 

the discretized version E f^'^l? 1 '^'" ) 2 < <?> 
where E(f(x)) denotes the regression function or expec- 
tation of y — f(x) at point x. 

The first one can clearly be interpreted as a pair of 
question and answer. We see x = q x as a question 
for state / about its value at x and denote the an- 
swer by y — q x {f) — f( x ) (i- e - f( x ) denotes a ran- 
dom variable), generated by a probability distribution 
p(y\x,f). We write such a pair as (x,y) = (<}x,yq x )- 
Information about smoothness can be seen as answer 
to a more generalized 'smoothness' question q s j — 

^gn (j2 i ( E(J(x ' )) -l ( x { (x ' +Ax ' )) ) 2 - fl) . We write this 

pair as (q s ,6,ys,e) with q s ,e(f) equal to ±1 and remark 
that replacing the fixed by a random variable with 
mean is one easy way to generalize to the case of a prob- 
abilistic answer. Now note that in contrast to a simple 
example data question the smoothness question depends 
on more than one x-value of the function /. In this sense 
the smoothness question may be seen as nonlocal. Non- 
locality is also present in symmetry questions, e.g. like 
Qp,o = sign (Y^i(E(f(xi)) - E(f(-x { ))) 2 - 0) or more 



general q Sj0 = sign (Y^i( E (f( x i)) ~ E(f(sxi))) 2 - 0) 
where s represents any symmetry operator under which 
invariance of / is tested. Also bounds for the nonlo- 
cal maximum function max x E(f(x)) are widely used. 
Those three types of bounds, on smoothness, symmetry, 
and maximum functionals, are the the most frequently 
used forms of prior information in practice. 

There is one practical distinction between nonlocal 
prior information like smoothness and standard train- 
ing examples: the latter are assumed to be empirically 
measurable while the 'measurability' of nonlocal data 
like smoothness is often quite unclear. The difficulty lies 
in the fact that these nonlocal questions depend on the 
whole function f(x), and for continuous x there is an in- 
finite number of function values. But, if we want to treat 
nonlocal information in a similar way as standard train- 
ing examples we also have to discuss the measurement of 
such data. This paper shows that, besides cases where 
highly parallel measurement devices are available, also 
restrictions of measurement devices and the definition of 
situations of interest and their control can be interpreted 
as measurements of such questions. 

There is another group of nonlocal questions, for 
which data are more easily available in practice. These 
are nonlocal questions depending on only a finite (and 
in practice 'small enough') number of f(x{). Empirical 
answers to these questions can be found by measuring 
standard training examples f(x) and applying some op- 
erations to the results which define the questions. Ex- 
amples include differences f(x{) — f(xj), weighted aver- 
ages ^2 i Wif(xi) like discrete wavelet components, sym- 
metries s for specific points f(x{) — f(sxi), and mea- 
surements of f(x) with input noise. When the expec- 
tation E(f(xi)) can be approximated by ^/n^2 i f(xi), 
that is by using repeated measurements, we can, at 
least approximately, also measure examples depending 
on E(f(xi)). Sometimes only nonlocal data are avail- 
able, as in models where the variables of interest are 
hidden but related by structural assumptions to several 
observable variables, like for example in hidden Markov 
models. However, even when the variables of interest 
are observable there can be measurement devices which 
measure finite nonlocal questions with greater accuracy 
than by using the measuring devices for f{x). Specifi- 
cally, differences are often measured better directly than 
by first looking for f(x{) then for f(xj) and then tak- 
ing the difference. This is always the case if part of the 
measurement error appears in a final step common to 
all measuring devices, like an output scale with fixed fi- 
nite resolution or memory errors. We also remarked that 
the definition of a generalized question is called a rule in 
expert systems. Thus, available rules collected from ex- 
perts may be incorporated as nonlocal question. Also, 
many examples of measuring nonlocal questions can be 
found in physics. The energy function of a macroscopic 
system is highly nonlocal with respect to the microscopic 
constituents. It seems that nonlocal data are often avail- 
able, however in many cases their practical combination 
with priors of infinite nonlocality is probably hindered 
due to the difficulties of the Frequentist approach in 
statistics in dealing with nonlocal questions. That means 



that for data with infinite nonlocality the measurement 
problem represents an empirical difficulty, while includ- 
ing complex forms of nonlocal data, even with finite non- 
locality, yields easily to mathematical difficulties related 
to solving nonlinear equations. 

The paper formulates a theoretical approach in which 
answers to generalized questions are treated similar to 
standard training examples. In Section 2 the theoretical 
framework is presented, paying particular attention to 
the generalization problem. It is clarified that not the 
(standard training) data, but the nonlocal (prior) data 
enable learning. 

Section 3 presents generalized questions in more de- 
tail. The importance of nonlocal information for learning 
makes it necessary to discuss the fundamental measure- 
ment problem for nonlocal questions. For example, re- 
strictions of measurement devices correspond to nonlocal 
measurements. 

Section 4 relates the classical approximate symmetry 
priors (e.g. smoothness) to nonideal measurement de- 
vices, the specific restriction being input noise. 

Usually learning problems are defined and controlled 
by man. Consequently, nonlocal dependencies may be 
caused by internal human concepts. In addition experts 
may contribute knowledge in form of verbal statements. 
Section 5 discusses principal variants of including sub- 
jective priors (which is meant to include priors caused 
by subjects), like an interface with fuzzy priors. The 
Section does not aim to give an overview over this active 
and growing area of research, but presents some princi- 
ples and aims in explaining the origin of nonlinearities 
which appear as technical difficulties in Section 9. 

Section 6 extends the theoretical framework of Section 
2 to decision problems using the language of generalized 
questions. Parallel and inverse settings are contrasted. 

Section 7 contains a discussion of the Bayesian ap- 
proach, concentrating an the saddle point approxima- 
tion, leading to the two step MaP-MiR procedure. 

Section 8 discusses the Frequentist approach of empir- 
ical risk minimization from the Bayesian decision theo- 
retic point of view. Especially the relation to the max- 
imum posterior approximation in approximation and 
non-approximation situations is discussed in detail. 

Section 9 shows the possible use of generalized in- 
formation by discussing maximum posterior variational 
(mean field) equations for several variants of nonlinear 
regularization procedures. In general those have the 
form of inhomogeneous integro-differential equations, 
like they appear for example similarly in a time inde- 
pendent formulation of quantum mechanical scattering 
theory. 

Finally, Section 10 exemplifies the ideas of nonlinear 
regularization on a numerical example. 

2 A constructive Bayesian approach 

2.1 Model and vocabulary 

In this section we choose a Bayesian approach (See for 
example Berger, 1985; Haussler, 1995; Bishop, 1995a, 
Wolpert, 1996a) and attempt to separate in a clear man- 
ner the local from the nonlocal parts in our model. 



Therefore, we begin with the construction of the local 
components: We enumerate a set of basis questions x 
necessary to determine the possible states we are inter- 
ested in. The term constructive means that we do not 
start with a model given by some chosen parameteri- 
zation (see Wolpert, 1994a), however, we use the rele- 
vant questions we are interested in to construct the re- 
lated model. For every question we give a set of possible 
answers together with a set of question specific answer 
distributions p(y|a?, /£). Each such answer distribution 
represents one possible pure local state /£. Its index x 
allows the local states to be defined independently and 
individually for every basis question x. In this paper a 
pure state is indicated by an superscript 0. We assume 
the nature to be in exactly one pure state. 1 Which one, 
is usually unknown to us. Thus, in contrast an f x with- 
out the superscript denotes not a real pure state but 
a state of knowledge being equivalent to an assignment 
of a probability p(f x \f x ) to every pure state, where we 
often skip f x in the notation. In a learning situation we 
assume the actual, usually unknown, pure state (of na- 
ture) to be constant 2 for the time under study, while the 
state of knowledge, which reflects the learning process, 
changes with our information. 

The independently defined local components are re- 
lated by nonlocal states. A pure nonlocal state /° is 
defined through an assignment of one local state /° to 
every basis question x and a nonlocal state of knowledge 
/ through an assignment of a probability p(f°) to every 
pure nonlocal state, with p(f°) implicitly understood as 
p(/°|/). The probabilities p(J ) contain all the relations 
between local components and therefore the nonlocal in- 
formation. 

We now define the ingredients of the theoretical ap- 
proach more formally: 

1. A local baste model consisting of 

a. a (finite 3 ) set X of basts questtons x, 

b. a (measurable 4 ) space Y x of possible answers 



Thus, a pure state is maximally specified. We allow here 
a pure state to be probabilistic, so cases are possible where 
one pure state has the same probability distribution as a mix- 
ture of other pure states. In contrast to such an equivalent 
mixture a pure state is a fixed point under arbitrary learning. 

Constant is meant relative to a chosen reference system, 
which might also, for example, be varying in time. 

3 For infinite X the corresponding functional integration 
represents a stochastic process in the language of mathemat- 
ics or a field theory in the language of physics and must 
be consistently denned for every finite subset. Despite tehe 
existence of interacting field theories in physics, only func- 
tional integrals with Gaussian measures are mathematically 
well denned objects (Gardiner, 1990; van Kampen, 1992; 
Schervish, 1995). Non-Gaussian functional integrals have 
to be be denned for example by perturbation theory using 
the Feynman-Kac formula or by discretization, i.e. an ultra- 
violet cutoff (Glimm-Jaffe, 1987, Zinn-Justin, 1989, Bialek, 
Callan, Strong, 1996, Balasubramanian, 1996). For the use 
of Gaussian processes especially in Bayesian statistics see e.g. 
Williams, Rasmussen, 1996; Barber, Williams, 1997; Neal, 
1997. 

In the mathematical sense. 



y x for every question 5 , 

c. a (finite 6 ) set of probability distributions (lo- 
cal elementary functions) over answers F® — 
{p(y\x, /£)} for every question, the pure local 
states , 

d. a set F x of local states of knowledge f x defined 
by assigning a probability p(f x ) to every local 
pure state /£. 

A local state /° can be considered indexing a vector 
p(y\x,f®) in the linear space T{Y, x) of functions de- 
fined on Y = Y x , i.e. f x E T{Y,x), with norm |/£| = 
Y, y P(y\ x >fx) = YsyP(y\ x J°) = l- A state of knowl- 
edge f x is in the convex hull of F® , with ^2 f o p(f x ) — 1- 
We denote by C(V) the convex hull of a set V of vectors 
V{ generated by linear combinations ^2 i a{Vi with coeffi- 
cients fulfilling Y,i a i = 1 ( e -g- * = /£> a i = P(fl\fx))> 
and by C the linear span, generated by unrestricted lin- 
ear combinations. So we have 

F° x CF X = C(F° X ) C C(F° X ) C T(Y, x). 

Now we construct the nonlocal parts. 

2. Pure nonlocal states /°, or, more shortly, pure 
states, are defined by a set of pairs (p(y\x, /£), x) 
containing exactly one local state for each ques- 
tion: {(p(y\x,f°),x) : f° £ F°,x £ X} to give 
p(y\x,f°) — p(y\ x if x )- They form the set F° of 
pure states or elementary functions, which is a sub- 
set of the linear space of functions defined on7 x ,X, 
/° E T(Y X ,X). We implicitly understand states 
(functions) as equivalence classes defined with re- 
spect to X by [f] x = [f°] x o Vx £ X, Vj/ G 
Y x ■.p(y\x,f°)=p(y\x,f' ). 

3. A nonlocal state of knowledge /, or, more shortly, 
a state of knowledge is defined through as- 
signing probabilities p(/°|/), or often shortly 
p(/°), to the pure states f E F° of 
the model. Thus, / denote convex combi- 
nations of pure states /°, i.e. p(y\x, f) = 

Zfo eF oP(f°)p(y\x,f ), withE/o^oM/ ) = i,- 

The set F of possible states of knowledge (also 
called mixture states) / of the model form the con- 
vex hull F = C(F°) of F° C <8) X £(F£). By con- 
struction we have p(f°\x) = p(f°) stating indepen- 
dence between states /° and basic questions x. A 
state of knowledge is called factorial with respect 
to Xiffp(f°) = U x P(f°y, V/° £F°. It is fully de- 
termined by its local probabilities p(f x ) assigned 
to each local state for question x. 7 

This paper discusses predefined dependencies between 
answers to different questions. The role of such structural 
information is discussed in section 2.3. Structural infor- 
mation is implemented by defining generalized questions. 



The set Y x can be assumed ^-independent without loss 
of generality, so we will usually write Y x = Y. 

The case of an infinite space F x requires the definition of 
a measure df x . 

7 Y^ f o eF o P(fx) = 1, Vx € X ensures Y^ f o eF o p(f°) = 1- 



4. A generalized question is any probabilistic func- 
tional q(f) of /. A probabilistic functional q 
is a probabilistic mapping from the space F 
of functions / to a space y q of answers y = 
q{f) given by probability distributions p(y\q, /) = 
Zfo eF oP(f°)p(y\q,f ) defined by p(y\q,f°). We 
require the ^-dependency of p(y\q, f°) to be ex- 
pressible in terms of only the p(y\x, /°). 8 9 10 We 
define some set Q of generalized questions to be 
measurable (or observable) by assuming, for every 
question in Q, the existence of a measuring device 
producing the corresponding answer. As consis- 
tency condition we require any question depending 
on a finite number of answers to measurable ques- 
tions also to be measurable. We call the set of 
x E X on which the q E Q depend the basis 11 of 
Q and denote it by X® . Skipping the unobserv- 
able elements x (£ X® we write X = X® . Data D 
are pairs of questions and corresponding answers 
(q,y q ). In section 3.1 we will show how to write 
generalized questions explicitly, and section 3.2 dis- 
cusses the measurement processes. 

For applications we use a decision theoretic framework 
(see also Section 6) and separate the two subsets 

5. Q D C Q of available or training questions for which 
answers are available. 12 The basis of Q D will be 
denoted by X® — X D C X . Dqd denotes the set 
of all possible data which only depend on questions 
q E Q D . We denote by q D , and analogous for other 
sets of questions, vectors with components qf E 
Q D . Here, qf = qf <EQ D ,i^ j is also allowed. 13 
By q D E Q D we mean that qf E Q D for every 
component of the vector. If a vector appears in Q D 



Aside from the possibility that a functional is not denned 
for a specific / , there exists the possibility of a functional 
having only a certain probability of being not denned if ap- 
plied to a function / , like q(f ) = 1/y if p(y = 0\x, f ) ^ 0. 
Formally one can add the value 'undefined' to the space Y, 
as it is common in programming. 

One could allow a dependence of the definition of q, i.e. 
the p(y\q,f,f ), from the state of knowledge / (which is 
known, as the name indicates), i.e. from the p(f \f). How- 
ever, this is only important when studying the dependence 
from /. A dependence p(f \q) would mean that selecting q 
already changes the hidden variable / . By definition of the 
model we have p(y\x, /, /°) = p(y\x/f°) and p(f°) = p(f°\x) 
for the basic questions. 

Compare for example with similar concepts in Ratsaby 
& Maiorov, 1996 and Smola & Scholkopf, 1997. 

This is not a linear basis of a vector space (Compare 
Jeffrey, 1968). 

This is related to learning (see 11., below). The set Q 
depends on what part of information is considered part of the 
prior A prior / can be expressed depending on some data D 
and another prior / according to p(f \D) = p(f \f(D,f )) 

= p(f \f(DJ I (D J I, ))=p(f°\D 1 D ). Note that here the 
sloppy notation p(f \D) = p(f \D, D ) refers to different pri- 
ors on the right and left hand side which are not explicitly 
indicated. 

13 In as far as no linear structure is needed q can also be 
seen as a set, with dummy indices for repeated elements. 



we understand not the vector but its components 
to be elements. An data vector D is a question- 
answer pair (q D ,y D )- We may decompose D into 
lower dimensional vectors Di = (qf,yf) C Dqd, 

iei. 

6. Q l C Q of relevant or test questions defining the 
possible application situations. The basis of Q l is 
denoted by X® — X 1 C X, Dqi the set of all possi- 
ble data depending on Q l , and we use D l to denote 
a test data vector 14 i.e. a set of pairs (q 1 ,y q i) E Dqi . 

The set Q l is called relevant, because 

7. we assume a given loss function l(q l , y q i , z), defined 
for all q l E Q l , i.e. application situations. The loss 
function depends on the question q l , the answer y q i 
and potentially also on additional variables z E Z. 
We will call z the action variables as we usually 
allow them to be controlled. 

We include the possibility to control the action vari- 
ables z E Z within the loss function by an active choice 
of an action state / E F. Thus, we define 

8. a family / E F of possible action states produc- 
ing the probabilistic action z E Z according to 

We write l(q l , y, /) for the effective loss if the z variables 
in l(q l , y, z) can be integrated out. 

The possible application situations, i.e. the relevant 
or test questions q l , are generated by 

9. a test question ^-producing device p{q\y C) z c ) 
which, conditioned on a subset c of 'past' values 
of y c E Y and z c E Z , does not depend on /°, / or 

/• 

The probability distributions p(y|<2,/°), p(<?|yo ^c)? an d 
p(z\q J ,y, f) define a /-dependent loss distribution 

P(/|/,/): 15 

P(l\fj) = dq dy dzp(q\y c ,z c ) 

xp(y\q, f)p(*\<i, v^ /)<K%, y, z ) - 0- 

The normative component is represented by the require- 
ment to minimize 

10. a risk functional r\p(l\f, /)], which is a mapping 
from the loss distribution p(l\f, /) into a subset of 
the real numbers, bounded from below. A common 
risk functional is the expectation of / or expected 
riskr(f,f)=fdlp(l\fj)l. 16 

New available data D require updating of an initial 
state of knowledge f 1 to obtain a new /. 



Not to be confused with the validation set of empirical 
test data like it is used in cross-validation (see Section8.4). 

See Section 3 and Section 6 for details and justification 
of the chosen components and their probability distributions. 
16 Minimizing an expected loss was already proposed by 
(Laplace, 1810ab) and later revived by (Wald, 1938) (See 
the historical remarks in (Le Cam, Yang, 1990)). A general 
formalization can be found in (Le Cam, 1986). 



11. A learning model is a mapping / = f(D, f 1 ) from 
F to F parameterized by D. The initial state will 
be called poor state. The Bayesian learning model 
is defined by 

p , mA/ , )) = P faV,/>(/°l/') 

p(y D \q D ) 

We will from now on skip f 1 in the notation and 
write piflfiDJ 1 )) = p(f\D) and p(f\f) = 
P(/°)- We can write this in a form 



p(f°\D) 



Ef<oT^(/°,/X/' ) 



E / »o£ / ,oT*(/»°,/'>(/'°) ! 

or shortly, in matrix notation 

Tp 



P 



TvTP' 



with P(f'°J" ) = p(f'°) and Tr denoting the 
trace. The so defined matrix 

T D (f\f) = 6 r0J0 p(y D \ q D ,f), 

is diagonal in ^-representation. This shows that 
the pure states /° represent the possible fixed 
points of learning. 17 

Section 2.2 discusses learning in factorial states. 

The local basic model including the local probabilities 
p{fx) represent the local pari of prior information, the 
definition of generalized questions the structural part of 
prior information. Thus, a factorial state has only lo- 
cal and structural prior information and in any state 
of knowledge the nonlocal, nonstructural information 
should be information resulting from measurement or 
control added to a factorial starting state. After clar- 
ifying the importance of nonlocal information we will 
discuss their possible measurement. Table 1 summarizes 
some of the notations. 

Examples of basis questions 

The basis questions p(y\x, /°) define the answer prob- 
abilities for the available data Q D and relevant ques- 
tions Q l . Their definition is therefore not independent 
of the available measurement devices for Q D and Q l , 
and they cannot be chosen arbitrarily. Choosing a set of 
p(y\x,f°) is part of the local prior knowledge. Mainly 
interested in generalization we do here not concentrate 
on the local part. We assume a formulation of the prob- 
lem with a local prior p(f®) where the local part is learn- 
able, i.e. where the local state of knowledge p(f® \f x {D x )) 
asymptotically converges to one /° under local ques- 
tions x. according to some convergence criteria of our 
choice. This is without loss of generality, because we 
could always consider questions about the possible densi- 
ties p(y\x, /°), i.e. study the density approximation prob- 
lem. Practically, this does not change the situation, but 
splits formally the local part into several parts, so the 



Notations 



x e x 


basis questions 




q£Q 


generalized questions 




y&Y 


(probabilistic) answers 




yA 


(probabilistic) actions 




z 


internal variables 




D = (Di,Dy) 


data 




f° &F° 


pure states 




feF 


states of knowledge 




feF 


action states 




P(y\qj°) 


probability (or density) 

for answer y under question q 

in state /° 




p(z\x,y,q) 


probability (or density) 

for internal z given data (x, y) 

in question q 




p(y\q,f) 


probability (or density) 

for action y under question q 

in action state / 




p(q\y,f) 


same in inverse model for action 


q 


p(f\f) 


probability of /° under / 




Mf°) 


regression function at x in state 


f° 


L = Inp 


log-probability 




l M> y > z ) 


loss function 




K<i>yJ) 


loss function integrated 
over probabilistic action or 
for deterministic action 




r 


risk functional 





Table 1: Some notations frequently used in this paper 



1 Notice, that if T has degenerate eigenvalues, also non- 
pure states may be unchanged. See Theorem in Section 2.2. 



problem appears as a nonlocal one. 18 Also, in a possi- 
ble loss function for density 

Also, x must not necessarily be a single minimal com- 
ponent, but we can combine many x (with e.g. previ- 
ously learned dependencies) to a larger x vector. Tech- 
nically, one x just denotes one independently parameter- 
ized of subset of F° , and the question of generalization 
is the question of generalization between such sets. 

Basis questions can be Gaussian 

(y-yx(f )) 2 

p(y\xj°) cxe 2 "* 

so that states /° are parameterized by their regression 
function y x (f°). In general, the parameterization of the 
p(y\x,f°) (and therefore of the states) can be arbitrary, 
e.g. also the variance (j 2 x or higher order moments can be 
/°-dependent. 

Consider as a more complex example image y produc- 
ing states (generative models), e.g. with x having the 
values face and non-face. States /° are defined by their 
generation probabilities for images of faces p(y|face,/°) 
and of non-faces p(y\ non-face, /°). Generation of faces 
in a state /° could be defined 

p(2/|face,/°) = / dvp(v\fa,ce,f )p(y\v,fa,ce,f ), 

with v being an index for the different variants of a face. 
Possible variants include for example interpersonal dif- 
ferences, varying view points or changing illumination 
conditions. Using some interpolation scheme, like op- 
tical flow and correspondence of some reference points, 
a continuous v could be constructed out of a discrete 
set of examples. Also, human prior knowledge may be 
that faces have constituents j like two eyes, mouth and 
nose appearing and being combined in different variants. 
In the easiest version with independent constituents one 
could choose 






p(yk,/°)X)K/y\J^\,M/°l/V) = Kyk J /)- 



(d > J r 



J (*,j». 



with i be- 



p(y\v, face, / ) oc e ^-^-j 

using some distance (c?J J (y z )) 2 = \\yi — y l v 

ing the pixel index and y? ,J = y- ,J (f°) being a template 
for variant v for constituent j. Then the p(v\ face, /°) 
parameterize the face states. 

2.2 Analysis of generalization 

2.2.1 Minimal models and sufficient data 

The ability to generalize is essential for any real learn- 
ing. It is easy to see that generalization requires nonlo- 
cal dependencies contained in the p(/°|/). Let us de- 
note by Dx\ x data which do not depend on x and 
combine all f®, for x f ^ x into f x \ x - Then using 

J2 f ° xXx p(fx\J D xv) = 1 = J2 f ° xXx p(fx\ x ) we see that 

in a factorial state where p(f x \f x \ x ) ~ P(fx) 

p(y\x,f'(D xv ,f)) = y>(!/k./>(/°|£>Av) 






18 In a usual density estimation problem there is only one 
x (with the meaning 'get the next y y ), and, besides a lo- 
cal positivity restriction, a natural (nonlocal) normalization 
condition. 



f° 



This means that data not depending on x can never 
change the answer probabilities to x. We can say that a 
factorial state allows no inference and represents a 'tab- 
ula rasa' situation with respect to generalization. Thus, 
starting from a factorial state we necessarily need nonlo- 
cal information to enable any nonlocal learning. We can 
always relate / = f(D,fj ac t) to a factorial state fj a ct 

by 

p(f\f)=p(f\D)=p(D\f)p fact (f)/p(D). 

Now we formulate this observation in a bit more gen- 
eral way and show under which conditions the conclusion 
can be reversed. We begin with a simple example to out- 
line the general idea. We consider an example, where F° 
does not only consist of extremal points of the convex set 
F . Let a space F° with three possible pure states, be 
defined by p(y = l\x,f?) = 1, p(y = -l\x,f$) = 1, 
p(y = ±l|a?, f%) = 0.5. This may be the probability that 
a certain gender is marked on an application form for girl 
schools, boy schools, and coeducated schools. For exam- 
ple a state of knowledge / with p(y — ±l|a?, /) = 0.5 can 
be expressed as 



1 



-(P(V\*> fi) + P(V\*> / 2 °)) + ap(y\x, /°) 



p(y\xj) 



= P(fi)p(y\*, A ) + P(f$)p(y\x, / 2 °) + P(8)p(y\x, fs) 

for any < a = p(/ 3 °) = 1 - £LiK./?) < 1. Now 
assume, that some data not related to x change the 
probability of p(f^) = a. Obviously, this has no in- 
fluence on p(y\xj f). We will call such a space F° non- 
minimal with respect to x, and the set of data (x,y) 
not sufficient with respect to F° . Now consider a space 
Q x of local questions q x which includes also repeated 
measurements of x. The probability for a repeated 
measurement y(x),y f (x) is the product p(y,y f \x, /) = 
p(y\xj f)p(y f \x, (y, x)) which changes with p(f^). The 
probability for a measurement (y — l,y f = —1) would 
be zero if p(/°) = -1, but 1/4 if p(/°) = 1. Then, the 
coefficients piff) which define a state / are unique, and 
if they change, they change the probability distribution 
of some question q x . We will say the space F° is minimal 
with respect to the set of data D x — (q x , y qx ), and data 
D x are sufficient for F° . 

We define equivalence classes /_£> = [f°]D Q of Dq- 
equivalent states for the set of data Dq of the form (q, y) 
with q E Q by 

f°D Q = f°D Q ' O V(<z, y)ED Q : p(y\q, f°) = p(y\q, f'°), 

forming the set Fp . In the case in which the data 
contain all y E Y q for every question q E Q we speak 
of Q _ec l u i va l en t states and write /q. We defined .P = 
F x . The same constructions can be applied to states of 
knowledge yielding fq and Fq. 

We define data Dq to be sufficient with respect to .P 
or equivalently the set F° to be minimal with respect to 
Dq iff all states of knowledge / E F(F^ ) are uniquely 



decomposable into the /_£> , that is iff there is no solution 
p(Jd ) °f the following system of homogeneous linear 
equations 

f° 



This means that no pure state can be expressed by others 

p(y\q,f°D )= T, p(y\q,fD )p(fD )- 



E 

f/0 -£f0 



Then the corresponding system of inhomogeneous linear 
equations for p{y\q, fy ) is overdetermined and there- 
fore there exists at most one solution for the state 
p(Jd )- (^ least one solution exists by construction.) 
This means that sufficient data determine the state of 
knowledge / uniquely. In other words, the convex hull 
Fd q of Fp does not contain equivalent states, i.e. 

[F(F° Dq )]d q Q =F(F° Dq ). 

To shorten the notation we can write for 

Jd q 
introducing indices i = (<?, y), j = /_£> 

Pi = J2 A ijPj> 



p = Apt . 
The matrix (or integral operator with kernel Aij) A = 
A(D, F°) describes the model F° with components /° 
on a data vector D with components (g, y) and pf with 
components p(f° \f) is the state of knowledge. Minimal- 
ity requires the number of independent pure states /_£> 
to be not larger than the number of data, i.e. question- 
answer pairs (q,y), which means that the rank of the 



matrix A 



((y,9)J° D ) 



p(y\ f L Id ) i s ec L ua l t° th e number 



of /_£> . Summarizing, minimality of the model with re- 
spect to data, or sufficiency of the data for a model, is 
defined as a situation where the 'model-data matrix' A 
has rank equal to the dimension of pf , i.e the number of 

f° 

Dq \ 
Minimality/sufficiency can be achieved by deleting 

some /° or by including more y or new q in the minimal- 
ity condition. This can be done by including new inde- 
pendent questions, using a finer scale or add new dimen- 
sions to y. Examples include repeated measurements of 
the same x, Qx consisting of n-tuples (x\ — x,X2 = 
x r -,x n = x) - xf =1 x, Q x - \J^- 1 Qx\ multiple 
measurements within X, Qx consisting of n-tuples 
xf =1 Xi, with F w C <S} { C(F Xi ) or of varying length 

Qx = U~i Q ( x } with F Qx C ®Zi ®? C(F Xit J 19 . 



19 Which is similar to the construction of the Fock space 
for a many particle system, except that here the space is 
restricted to the convex hull (constant L\ norm) and not to 
a region with constant L2 norm. 



(2) 

For two-components data vectors from Q x the prob 

(2) 

ability p(y 1 ,y 2 \x 1 ,x 2 , f) = p,y has to fulfill, 



Pi 



(2) 



AikAjkp k 



k 



E 



A ik Al tj b k)kl p ! kl , 



and therefore 



f (2) 
with diagonal P k k , 



fik,k'p{- For general data vectors 



p(") = (0 J 4) J P / '( n 



W^h P£W a 



Jfc n _i = 117=1 f*k, kiPk- A model minimal 
for single measurements is minimal for multiple measure- 
ments. For a minimal model the /° are linear indepen- 
dent, and in case the number of data (equations, condi- 
tions) is larger than the number of /° there must exist 
a reduced n x n system with nonzero determinant, so 
the solution is unique. Thus, we choose a decomposition 
A = ( A //) with A' a square n x n matrix, so its determi- 
nant is defined, and for a minimal model there exists an 
A' with detA' = 0. The relation det(A'<g)A') = (detA , ) 2n 
for n x n matrices shows that the determinant for multi- 
ple measurements for the reduced system A 1 is nonzero if 
it is nonzero for single measurements. If a given solution 
is already unique for a reduced system, it is also unique 
in an extended system, where just more conditions are 
added, consistent with the solution. Thus, models min- 
imal for single measurements Qx are a ^ so minimal f° r 
multiple measurements, i.e. vectors Qx • 

Non-minimal local spaces F® are not commonly used. 
Minimal for example is a local space consisting of Gaus- 
sians at different locations. Then a convex linear combi- 
nation is not Gaussian but a Gaussian mixture state. 

2.2.2 Factorial priors 

Let us now consider two sets of data, Q (e.g. relevant 
questions) and Q D (e.g. training questions) with corre- 
sponding (test) data D l and (possible sets of training) 
data D — {D{,i E I}. The Di — (yi,qi) are allowed to 
be data vectors and may represent one possible collection 
of data which can be obtained during training. 

In a minimal model the following theorem states that 
the prior probabilities reflect already the possibility or 
impossibility of generalization: 

Theorem: For a set of (test) data D l C Dqi suffi- 
cient for F® = Fqi (or equivalently for F® minimal with 

respect to Q l ) and another set of (potential training) 
data D C Dqd sufficient for Fp = Fq D (or equivalently 

Fp minimal with respect to D) the following proposition 
holds 

V(«, y)£D<, VA G D : p(y\q, /) = p(y\q, f(D { , /)) 

O V/ ; % : p(f° D ) = p(fl f D ) = p(f?)p(f° D ), 

where f° D = f° uD = [/°]_d q1 u_d- The backward direc- 
tion does not require the two sufficiency conditions. 



The theorem gives conditions under which conditional 
independence of all relevant data of all training data, 
is equivalent to independence between all /° and /_£>. 
The stated conditional independence also means that the 
actual state of knowledge is an eigenstate for all matrices 
T Di (f°,f°) = 6 f so J0 p{y Dl \q D \f), or that all mutual 
informations 



In 



p(vW,f,Di) 

p(vWJ) 



In 



p(y,q l ,Di\f) 

K^s'l/MAI/)' 



are zero, and therefore also all averages of them. 

Proof: We show that factorial priors do not allow gen- 
eralization, and that sufficiency of D l and D excludes the 
stated no-generalization property for other priors. 

For q E Q\ abbreviating p(y\q, /'(A, /)) by 
p(y\q, Di), we write for the no-generalization condition 

p(y\q, f) = P(yk, Di) = £>(!/!«, />(/" I A) 
= ££Kj<I<z>//>(/°IA0K/£I/°>a) 



fO fO 



£Kj<i<z>/°m/°ia), 



because for q E Q l the probability p(y\q, f°) only de- 
pends on /j° and X^f° p(/dI/m A) = 1- Another 
summation over finer classes up to Dx is not neces- 
sary because £/o P(fiuDuD x \fiuD) = 1 - Set " 

J IUDUD x 

ting p(/,°|A) = p(/°) yields p(y\qj) = p{y\q,Di) giv- 
ing one solution of the nonlearnability condition. For 
a A-sufficient model the state p(fi\Di) is uniquely 
determined by the probabilities for the relevant data 
P(y\q,f°), q & Q' and y £ j,* so p(/°lA) = p(/°) ls 
also the only solution. Thus, sufficiency of D l excludes 
the possibility that the influence of data only consists in 
switching between D -equivalent states. 

Now we show that independence of /j° of the data 
Di together with the minimality of Fp with respect to 
D only allows a factorized prior probability. We insert 
a summation over /_£> into p(f®\Di) and write for the 

condition p(/° I A) =K/°) 

A/A = Af°lA) = ]>>(/£ I A>(/ ; °|/£)- 

f° 

A 

One sees that the condition p(f®\Di) = p(/j°) is fulfilled 
for p(//°|/_c>) = p{Ji)- This already solves the backward 
direction without the need of any minimality or suffi- 
ciency condition. For a model not minimal on Fp there 
still might be different states on D leading to the same 
posterior p(fi\Di) on F® . We now use minimality of the 
data D, to exclude the possibility, that there are depen- 
dencies which cannot be explored by D. The probability 
p(f^\Di) is related to the definition of the training ques- 
tions, i.e. to the piyf \qP ,/°), by 



P(f D \Di) 



p(yP\q?,f°D)p(f°D) 

J2 n p(y?\q?J D )p(f° D y 



Inserting this equation into the above equation for 
p(fi\Di) shows that according to the assumption of suf- 
ficient data D the coefficient vector a (/_£>) multiplying 
the matrix A$ jo = p(yf > {qf* , /_£>) must be unique: 

p(f?)p(f° D ) _ P'(f?\f D)p'(f°D) 



£,» p(v? \q?,f° D )p(f° D y Er D p(v? kf, AM/£) ' 

where p'(f° \ f p )p'(f° D ) = p'(f?J° D ) denotes another so- 
lution of the joint probability. Summation over f® and 
/_£> gives equality for the /° -independent denominators 
on both sides so that p(/j°, fo)— p(Ji)p(Jd) * s ^ ne only 
solution, q.e.d. 

Without restriction to sufficient data there might ex- 
ist spurious dependencies between equivalent states of 
knowledge, which are not observable within the given 
set of relevant questions. 

The formal structure of the Theorem and of its proof 
becomes clearer if written in a more abstract matrix for- 
mulation. With i = (q,y), j = (q D ,y D ), k = f°, I = /° , 
Aik = P(y\q,f?), B 3l = p(y D \q D J° D )/p(y D \q D ) (where 
for observed data the denominator is unequal to zero), 
we can write for the posterior p(y\q, D), in components 



k,l 



M iJMPh 



^ A ik B jlP [ v 



k,l 



which reads for matrices 

p = (A (g) B)p f . 

Theorem (matrix formulation): 

For a (sub)system of equations with invertible n x n 
matrix 20 A and m x m matrix B, i.e. with &etA ^ ^ 
deti? (Minimality/Sufficiency), the following holds 

pi (g) 1 = pi (g) po — (-4 (g) B)p* O p* — p{ (8) p f D . 



Indeed, according to 

det(A ®B) = (det A) n (det B) m 
with A and B also A (x) B is invertible and using 

(A&B)- 1 = (A- 1 0B- 1 ) 



and 
we find 



(A (x) B)(pi ®p D ) = (Api (x) Bp D ) 



p f =(A 1 pi®B 1 p D )=Pj ®P f D , 

thus pf factorizes. Formulated for probabilities condi- 
tioned on D, we used po = 1 for the theorem. For- 
mulating the theorem for joint probabilities p(y, q, D\f), 
or similarly p(y, y D \ q, q D , /), symmetrizes the formula- 
tion and gives A ik = p(y,q\ff), Bjj = p(y D \q D \f D ), 



20 For simplicity we skip the prime for A and B which we 
used previously for reduced matrices. 



(pi®PD)ij = p{y,q\fMy D ^ D \fD), with still M = 
A(g) B according to the definition of the model. 
Formulated in a basis X the theorem gives the 
Lemma : For (q x ,y) G D x C D x , D x ^ x C D x \ x and 
requiring for the forward direction sufficiency of the set 
D x for F® for all x and sufficiency of the set D x \ x for 
^x\x ^ or a ^ x ^ ne f°ll° w i n g holds 

Vxex,v(q x , y )eir,VD?\ { * } e z? x \m : 

p(2/lfe) = p{y\<ix,Df \ {x} ) 

oV/ eF°:K/ )=IlK/, )- 



rGX 



Remarks: 



1. The theorem states that a factorial state remains 
factorial after local learning and that there is no 
generalization possible across a 'factorization bor- 
der' for any learning algorithm. In factorial states 
any information concerning answers to question x 
can only come from questions depending explic- 
itly on the point x of interest, but not necessarily 
depending only on x. Answers to questions de- 
pending only on X \ X cannot be learnt under 
a factorial prior, and training data not depending 
on X 1 are uninformative with respect to relevant 
questions if not combined with other information. 
They can thus only have indirect influence. For 
analysis of prior information in terms of relevant 
questions we may choose X — Q . Then in order 
to enable learning for every q l — x for every x we 
must have data D depending (maybe not only) on 
that x. For example, a smoothness constraint in 
statistics depends on all x and can allow general- 
ization. Especially easy to analyze is the situation 
when Q D C X. Then all relevant questions are 
defined with respect to the data. In this case a 
factorial prior on X refers to the training questions 
itself and the relevant questions are directly defined 
in their dependency on the training questions. We 
discuss the relations in detail in the next subsec- 
tion. 

2. The forward direction, factorial state =>- no gener- 
alization, is related to so called 'No Free Lunch'- 
theorems (Wolpert, 1996a, 1996b), generalized 
from uniform priors (or uniform meta priors) to fac- 
torial states, without explicitly referring to a spe- 
cific form of loss function or algorithm. Indeed, 
uniform priors of the form p(/°) = p(/°/), W — 
possibly also with uniform probabilities p(f®) = 
P(fx°)i ^fx°i ^ x within single ^-components so that 
p(f°) = £>(/''°), V/ /0 — are special factorial priors. 
With respect to nonlearnability of specific sets of 
questions the theorem is sharper than results from 
the theory of uniform convergence (Vapnik, 1982) 
or regularization theory (Tikhonov, 1963), and in- 
dependent of a specific loss function. However, it 
does not give quantitative results. Also, a change 
of p(y\q, /) does not necessarily need to change the 



decision / (see 4.). A quantitative measure of de- 
pendency can be obtained by averaging the mutual 
information over a distribution for data Di and D l , 
or by integrating out D l and calculate an average 
change of the risk under a data distribution p(Di). 

3. The backward direction gives sufficient conditions 
for a no-generalization state to be factorial. The 
main point is that if the model has only one way 
to express the same state then it cannot switch be- 
tween equivalent descriptions and there is no possi- 
bility to create formal dependencies without visible 
effects. 

4. A change of p(y\q, /) does not have to a change the 
final decision /. It may be that 

a. the loss function is not sensitive to change in 

p(y\x,f), 

b. the risk, i.e a certain property of p(/|/, /) like 
its expectation, is not sensitive to a change in 
the loss, 

c. the decision is not sensitive to a change in the 
risk. 

Of principal interest are the optimality equivalence 

classes 

[/°]r* = [/'V O 

f* = argmin / ^ # r(/ ,/) = argmin / ^ / ,r(/ /0 , /), 

identify /° leading to the same decision /* . While 
this requires already calculation of the optimal de- 
cision it may be easier to calculate 

[/°] r = [/'°] r oV/ : r(fJ) = r(f°J). 

However for both variants, available data may have 
differing probabilities for different /° within such 
classes, so that p(f°\D) can vary within a class. 
Thus, analog to fjo in the theorem, one has to 
consider finer equivalence classes 

[f]r*,D or [f\ D , 

where only data D^ which factorize, like 

P([f°]r,D) = p([f°]r,D*f)p([f ]Df), ^ be skipped. 
All those equivalence classes can be finer than 
[/] r *j s ° this does not guarantee that data chang- 
ing p(/°), i.e. the state of knowledge /, change the 
decision /* . 

5. Pure states are special factorial states. In this sense 
factorial states are the possible starting and end- 
ing points of a learning process. For ^-variables 
which are visible and known, /° can be seen as 
collection of the remaining (stationary distributed) 
random variables. Making part of these random 
variables /° visible, that is moving them into a?, 
breaks the old pure states in new pure components. 
The learning process would ideally go from the old 
pure state, now assumed to be incompletely spec- 
ified and therefore only a factorial state of knowl- 
edge, to one of the possible new pure states being 
factorial and maximally specified. We do not re- 
strict pure states to be deterministic. Also proba- 
bilistic states can be not further decomposable in 
the situation under investigation. 



2.2.3 Factor dimension 

We now look for conditions under which learning can 
occur, i.e. when p(y\q) ^ p(y\q, Di) so not all mutual in- 
formations between available and relevant data are zero. 
We are mainly interested in the effects of generalization 
where q itself is not part of L>, Indeed, if for continuous q l 
the q l which are directly in the training set have usually 
measure zero, so learning has to go over generalization. 

Data not changing p(f°) have equal probabilities un- 
der all pure states /° or, equivalently, equal diago- 
nal elements (and therefore eigenvalues) of the matrix 
T(f f0 ,f°), projected to the subspace F%, for all /£. A 



unique maximal / 



one with maximum probabil- 



ity for Di and not excluded from /, would be sufficient 
to ensure learning for all f x which are not yet in the 
state /£'* . A change in p(f°) may however correspond to 
non-relevant learning, changing not the relevant distri- 
butions p(y\q, f) but only higher order interactions like 

p(y,i/\<i,<i f ,f)- 

If the potential data D are sufficient and the model is 
minimal for the relevant questions, then according to the 
negated version of the theorem for non-factorial priors 
there are training data Di so that relevant data (q,y) E 
D l change, Consider a locally minimal model with x — 
q l E X local data D x — (x D , y D ), x D E X and a given 
prior p(/°|/). Local data q D change the probability of 
y under x if p(f°) ^ p(f x ) \D x ) and therefore p(f°) ^ 
P(fx\fi))' Then within p(f°) states restricted to relevant 
and available data cannot factorize. Thus, the state of 
knowledge must fulfill the generalization condition for x 
under x D 

P(/°l/)# 

P(fx \fx\(x,x D )> f)p(fD \fx\(x,x D )> f)P(fx\(x,x D )i /)• 

We can characterize a given state of knowledge by its 
factor dimension dim j p(/) with respect to a set X, de- 
fined as the maximal number n of X{ E {a?i, • • • , x n } = 
X n C X so that still one /£ , with x E X n the same for 
all /°, can be factorized, conditioned on f x \x n ' Then 
learning has to occur under / if the number of local data 
n is at least the factor dimension of /, i.e. n > dim j p(/), 
as then no additional x can factorize. 

We give three examples: 

Consider for a space F° parameterized by local means 
y x for all x, defining a regression function y according to 
y(x) = y x : 

1. A nonlocal prior 21 p(f° = y) oc e~ c l^ x ^ yx ~ Cx - ) . 
Even though this state explicitly depends on ev- 
ery single x it is factorial in X . 

2. A symmetry (e.g. smoothness) prior p(f° = y) oc 
e~ c 2-^ x ^ yx ~ ysx ) ; s indicating some bijective sym- 
metry transformation (e.g. translation) x f = sx. It 
has a degenerated maximum y x — y sx which can be 
made unique by adding local data for every orbit 
of s. The orbit of x under s consists of all elements 
s l x which can be generated out of x by applying 
s any number of times, < i < oo. The factor 



21 We will discuss the measurement, i.e. the corresponding 
nonlocal questions in Section 3.2. 
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dimension for each orbit is equal to one. In case 
of several orbits the total factor dimension is only 
n — m + 1, with n the total number x, and m the 
size of the smallest orbit. 

3. Data with p(f° = y) oc e 2-^i Vx ^ (parity for 
zero/one variables), which require additional data 
about n — 1 different X{ to give a unique maximum. 
The sum can be seen as a local question in a space 
F° with a linearly transformed basis of /°. The 
factor dimension is n — 1. 

We make the side remark that the factor dimension 
has similarities to the concept of VC dimension. In both 
cases generalization is impossible if the number of data 
is smaller than the corresponding dimension. The latter 
is usually applied to a family of loss functions indexed 
by / E F in the context of empirical risk minimization. 
A small dimyc indicates that the loss function cannot 
vary too much, so the difference between minimal empir- 
ical and minimal expected risk can be bounded. These 
bounds are independent of the family F° , as long as 
certain minimal conditions, e.g. bounded p(y|<2,/°), are 
fulfilled. In contrast, the factor dimension, as used here, 
is independent of the loss function and therefore of F. 
The factor dimension has a similar interpretation for F° 
instead for F: Nonconstant, independent data require a 
family F° at least 'large enough' to contain all the dif- 
ferent combinations, while a smaller F° which excludes 
certain combinations implies the possibility of general- 
ization. We already mentioned that not only natural 
choices of initial states but also the asymptotical final, 
i.e. pure, states of a learning process are factorial states, 
having maximal factor dimension, Hence, learning can 
as well decrease as also increase the factor dimension. 
Usually one would expect a U-shape like dependence of 
the factor dimension on the learning process, analogous 
to the mutual information, which provides a quantita- 
tive measure of dependency between relevant and avail- 
able questions. Commonly prior information reduces the 
factor dimension and, at least deterministic, local data 
increase it. 

Let dimy ^ (F°) denote the (a^ -dependent) VC di- 
mension of a family of functions p(y|a?, /°), /° E F° , 
which is the maximal number of different pairs (x, y) so 
that for all of them exist /°, f° E F° with p(y\x, f°) > 
a X)V and p(y\x, f°) < a X}Vl < a X}V < 1, i.e. the 
maximum number of points (x,y) which can be shat- 
tered by F° . Assume now that there exists an a XjV with 
p(y\x,D x > ) > a Xj y > p(y\x,D x ,) for at least one pair of 
local data Dy — (a?,y>) and D 3 ^ = (a?,y<), Thus, a Xj y 
separates the posterior probabilities p(y\x, Di) of at least 
two possible data Di, and the actual state cannot be a 
pure state. Then with /° E F° the components of / 
withp(/°|/)^0 

dim F (/) < dim£y(F°). 

Indeed, if we assume dimy ^ (F°) = n then according 
to the definition of the VC dimension within a set of 
n + 1 questions x there is at least one for which ei- 
ther p(y\xj /°) > a x ^ y or p(y\x, /°) < a x ^ y is impossible 



for all /° E i^°. As convex combination the probabil- 
ity p(y\x,f) for state / can only lie between extremal 
points, i.e. pure states /°. Therefore, the assumption of 
existence of both L>< and Dy, used to construct a x ,y, 
implies that there exist /° with p(y\x,f°) > a^y as 
well as f f0 with p(y|^,/ /0 ) < ct Xj y with nonzero prob- 
ability p(f° \f) ^ ^ p(/ /0 |/).' Thus, the probabil- 
ity p(fxJx 1 r--Jx n \f) cannot factorize, which gives 
dim F (/) < dim^ c y (F°). 

If the number of data is smaller than the factor di- 
mension of / learning can still be relevant if Q l contains 
multiple measurements (with respect to what has been 
considered an element in calculating the factor dimen- 
sion). Multiple measurements of single components q l 
means considering their interactions also to be relevant. 
Using a measurement vector with one answer for every 
x E X , i.e. q = X, makes for locally minimal models ev- 
ery learning for local data relevant. For example, know- 
ing yi — 2/2 = does not change p{y\) or p{y2) if they 
have equal prior. However, it drastically changes the 
probability for a joint measurement of y\ and 2/2 • This is 
an example where learning only occurs for non-relevant 
higher order dependencies. If higher order dependencies 
are missing, so that for a state with factor dimension n 
already m relevant questions factorize, then only n — m 
data are necessary to allow generalization to a single 
component q. 

An example of taking into account dependencies be- 
tween components q l by multiple measurements, is the 
special case of the risk of an algorithm which uses past 
test data (yi,q\), i < n to improve the selection of the 
next action for q l n . Here the next action and therefore 
also the loss function does depend on all previous test 
data (yi, q\). Then the risk does not consist of a sum of 
terms depending on disjunct sets of q\, and correlations 
between components q\ up to order n > i can be impor- 
tant. Thus, the effective q l is the vector of components 
q\, and the expected risk (on-line risk of an algorithm) 
is an average over this vector q l . Correlations between 
answers to different q l (vectors, not components q\) are 
not measured by this risk. However, the dependencies 
between vectors q l should be smaller than between com- 
ponents q\ if p(f°\(y, q 1 )) is nearer to a pure state then 

P(f\Vi,l\)- 

Loosely speaking, we only need to know what we can 
see, we only can know what we have seen, and it is struc- 
tural information which allows us to infer indirectly. 

2.2.4 Generalization— related sets of questions 

We now give the definition of some sets of questions, 
related to the previous analysis of generalization. We 
choose Q = Q l U Q D , X = X 1 U X D and define the 
following sets of questions: Training questions q in Q D 
but not in Q l are called non-test questions q^ 1 E Q D \ 
Q l — Q^ 1 , test questions corresponding to questions q 
in Q l but not in Q D are non-training questions q^ D E 
Q l \Q D = Q^ D . Analogously, we write x^ D E X 1 \ 
X D — X^ D for the non-training basis with x E X 1 
but x £ X D , as well as x^ 1 E X D \ X 1 = X^ 1 for the 
non test basis with x E X D but x £ X 1 . Q l > D = Q l n 



l,D 
l,D 



Q D denotes the set of common questions and X 

X 1 Pi X D the common basis. Clearly, we have Q 
Q-D = 0> Ql.D n q-,1 = 0> Q l = Q l,D uq ^d jQ d = 

Q I,D UQ^ 1 and the corresponding relations for X . Notice 
that the common (or non test, non-training) basis is not 
necessarily the basis of the common (or non-test, non- 
training) questions. 

Let us further introduce for sets Q f C Q, X 1 C X the 
notation Q f x , C Q f C Q for the set of all questions within 
Q f depending only on X 1 , i.e. with a basis completely 
within X f C X so that X Q> x> C X f C X. Similarly, 
Q(X') Q Q denotes the set of all questions depending 
(not necessarily only) on X 1 , i.e. with a basis having 
nonzero intersection with X' C X so that X Q ( X ') fll'/ 
0. Obviously, Q x , C Q l r x ,\ C Q', especially Q — QxQ — 
Q (x q), and Q f x , C Q' x , Q[ x/) C Q[ x) for X 1 C X. We 

can partition a set Q f C Q into two disjunct subsets with 
respect to X 1 C X according to Q f = Q f x , UQL^,y In 

particular, Q l — Qx- D ^Q\x D ) an< ^ Q° ~ Qx- 1 ^Qfx 1 ) 
with 

(°c — ^v(X D ) — VxD ^ (°c _ (°c x l — Q(X l ) — ^v •> 
and accordingly 

X 1 = X Ql D X Q \ xD ) D X hD D X Q x» D X Ql,D 

x Ql,D c x Q xi c x l > D c x Q ^ 1 ) c x qD = X D . 

(See Fig.l and for more details Fig. 2.) 

We have shown in the previous Subsection that in a 
factorial state the non-training basis X^ D is not learn- 
able and the non-test basis X^ 1 does not influence rele- 
vant q l directly. However, X^ 1 might well have indirect 
influence and act like noise sources within other ques- 
tions Q D depending on X I,D . Thus, information about 
states corresponding to X^ 1 enables learning about the 
noise structure within Q D . Thus, with reference to fac- 
torial states, we will call the set of questions Q X ^ D — 
Qx^ D Q Q l depending only on X^ D unlearnable ques- 
tions, Q\ x d\ — Q l \ Qx^d (potentially) learnable ques- 
tions, the set Q x ^i — Qx^ 1 Q Q D indirect questions 
depending not on X 1 , and QP x i) = Q D \ Qx^ 1 ^ %rtc ^ 
questions. Indirect questions alone cannot contribute to 
knowledge about Q l , their influence is indirect by con- 
tributing information about the unknown noise sources 
for questions in Qf x i\- Note that Q^ D D Q X ^ D and 

Q" 1 ' D Q x ^i which means that non-training questions 
can (and hopefully do) depend also on X D (so indirect 
information about them may be available) and non-test 
questions also on X 1 (so they can contribute directly). 
The non-training basis X^ D could be eliminated by in- 
tegrating over a corresponding (factorial) prior. Then 
all relevant questions depend only on X 1, , i.e. we have 

} x l > D 



Ql x i,D\ — Q x i,d- Therefore, we will call Q x i, D = Q x d 
the set of effective questions. 

In the following we give some simple examples (com- 
pare Section 2.2) that for Q\ x d\ learning is possible, 

but not necessarily for individual q l E Q\xd\ if the de- 
pendencies between different q l are not considered rele- 




Figure 1: The figure shows the set of questions Q and 
their corresponding basis X. The shaded area within 
X represents the common basis X I,D , the shaded area 
within Q the learnable questions Q\ x d v Learning re- 
lated to questions within the shaded area but not within 
the data Q D is called generalization. See Fig. 2 for more 
details and notation. 




X = basis questions 

Figure 2: This figure shows the relations in more detail 
than Fig.l. The basis of relevant questions Q l is X 1 , and 
for questions Q D corresponding to available data the ba- 
sis is X D . While Q D is assumed to be always finite Q l , 
X 1 and X D can be infinite. A model describing learning 
can be restricted to the common basis X I,D = X 1 C\ X D 
which is the double shaded area below. Only for ques- 
tions depending on that intersection learning can occur. 
This set Q XD is symbolized by the upper horizontally 



shaded 



Learning the set Q XD \ Q 



LD 



is called gen- 

i 



eralization. Questions in Q X ^ D , being in Q l but not in 
its shaded area, are unlearnable. Questions Qf x i\ can 

contribute information about Q l directly. Questions in 
Q x ^i , being in Q D but not in its vertically shaded area 
Qf x i v are not directly related to relevant questions and 
can only contribute indirectly, i.e. in combination with 

Q( X iy 
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vant. Consider deterministic x\ and x 2 with possible val- 
ues yi — ±1 with independent prior probabilities equal 
1/2. Then knowing only y\y 2 — ±1 without knowing 
something about y2 says nothing about y\. Thus, the 
prior factorizes with respect to the equivalence classes 
[f Q ]x 1 u{x 1 x 2 ) or [f°]x 2 u(x 1 x 2 ), if we choose x x x 2 and x x 
(resp. X2) as basis questions. If we choose all three as ba- 
sis then [f ]x 1 ux 2 u(x 1 x 2 ) cannot be written in a factorial 
form. Compare with the question y\ + y 2 for which also 
no information about y\ results from y\ + y 2 = 0, how- 
ever yi+2/2 = 2 determines y\ and y 2 . As soon as either 
yi or 2/2 is known both nonlocal questions give the value 
of the missing yi. If we do not choose Q l — {^1,^2} but 
instead e.g. Q l — {\y\ — y 2 \} the prior can be written 
in a factorial form with respect to y{x) — \y\ — y 2 \ and 
y f (x f ) = yi + y 2 : p(fx)p(fx') as those are in the given 
situation independent events. 

2.3 Why structural information? 

Structural information is our knowledge about the defini- 
tion of questions corresponding to predetermined depen- 
dencies between answers, or in other words, our knowl- 
edge of what we are measuring. One might wonder 
whether such predetermined dependencies between an- 
swers are necessary and corresponding data can be avail- 
able and useful. 

First, we remark, as discussed in the previous sec- 
tion, that without nonlocal information generalization is 
not possible. Any nonlocal information has two parts, 
a structural one represented by the definition and the 
answer to be determined by empirical measurement or 
control. Thus, in other words, in any case there must be 
predefined dependencies between answers to some ques- 
tions in order to be able to infer to unseen questions. 
So, as we will discuss in more detail below, requiring a 
bound on a certain smoothness or symmetry property 
one must for example know: 1) the definition of smooth- 
ness or symmetry in mind corresponding to the struc- 
tural part of information or the predefined dependency 
of the smoothness question from the other questions, 2) a 
bound on the allowed function values obtained by empir- 
ical measurement, active enforcing or pure assumption. 

Secondly, we refer to the observation that it is very 
common to assume dependencies between answers. In 
logic these dependencies are called rules. With zero- 
one function values the logical operations can be writ- 
ten using multiplication, addition and the step function. 
Fuzzy set theory deals with unsharp rules and Bayesian 
belief networks are the general probabilistic formulation. 
Compared with logic based artificial intelligence expert 
systems which construct many rules with often low non- 
locality (humans can better deal with lower order corre- 
lations) usual statistical models use only one, but highly 
nonlocal rule, e.g. some bound on a smoothness func- 
tional. However this difference is more of practical than 
of principal theoretical nature in as far as any number 
of rules, with maybe individually low nonlocality, can be 
combined into one ('vector') rule which can have high 
nonlocality. The definition a macroscopic observable in 
its dependence on microscopic variables is an example for 
structural information. One may say the usual macro- 



scopic variables in form of integrals over all x have max- 
imal nonlocality, depending on all basis questions. The 
definition of an 'energy question' could have the follow- 
ing form in a real, scalar, Euclidean field theory 

p(y\q E ,f) = $[Jdx ^f(x)) 2 +m(x)f(x) 



dx (h(x)f(x) + E (x)) - y 



where the first term has the form of a 'smoothness 
prior'. In terms of this analogy to physics, a statis- 
tical approximation problem with mean square error 
*l2i a xi(f( x i) ~ vF) 2 > data yf and a smoothness prior 
might be seen as minimizing a 'free' massless kinetic 
term, a x z — dependent 'mass' m(x) — a x ^2 i 8{x — X{) 
and 'external field' h(x) — —2a x ^ i 8{x — Xi)yf cou- 
pling to f(x) at the data points X{ and a constant 
Eq = a x ^2 i 6{x — Xi)(yf ) 2 . Higher order (local or non- 
local) interactions correspond to other forms of nonlocal 
questions. We will discuss in more detail below how 
answers to questions with infinite nonlocality can be en- 
forced by measuring devices or by controlling the situa- 
tions under investigation. 

Finally, we point out that one always has to know 
what one measures, at least in a probabilistic sense, and 
therefore structural information is necessary in any case. 
When repeating measurement of basis questions x in 
a fixed state /° we assume that we use the same de- 
vice with a stationary answer distribution. This means, 
for example, that if we measure in the same state hun- 
dred times y — 5 followed by a hundred y — 42 we 
might assume that our question x is incompletely speci- 
fied and another (hidden) variable should have been in- 
cluded which probably has changed its value after hun- 
dred measurements. Such not directly observable or hid- 
den variables define the state /° and we can say that 
changing the value of this additional variable represents 
a change in the unknown state /°. However, we have 
to define the (not directly observable) state /° to be 
constant to allow us to infer something about future re- 
peated measurements 22 . All controllable aspects of the 
model are attributed to x, so controllable actions like re- 
placing, moving, transforming an object to be measured 
are part of x. On the other hand variables which are 
themself stationary distributed just increase the noise 
and do not necessarily have to be included in x. Thus, 



If there is a known dependency on time (external time or 
internal time corresponding to the measurement history like 
the number of repetitions of a certain measurement) then 
the time variable is part of x. A measurement at a certain 
time can be repeated if the time variable can be reset (e.g. to 
zero). If time cannot be reset repeating the same measure- 
ment is not possible. One can think of other restrictions, so 
that a measurement of the same question cannot be repeated. 
For example, if continuous questions x are generated by a 
random process, the probability of repeating a measurement 
usually has measure zero. Those cases show the importance 
of information about relations between answers to different 
questions and not only between repeated identical questions, 
i.e. of nonlocal dependencies. 



stationarity of p(y\x, /°) is a form of structural informa- 
tion. 

In the next sections we will discuss possibilities of the 
inclusion of a larger number and variety of generalized 
data in statistical decision making processes (including 
classification and approximation) which are based on 
structural information. Those data correspond to prior 
information in the statistical language, rules in the logi- 
cal language, interactions in the physical language. From 
logical (fuzzy set, Bayesian belief network) expert sys- 
tems one can learn how to deal with a lot of different, 
heterogeneous rules and from physics the treatment of 
highly nonlocal (macroscopic) variables different from 
smoothness. 

3 Generalized questions 
3.1 How to write them 

A generalized questions is defined by giving a set of 
required basis questions and defining a function y q = 
q(y(x)) to be applied to their results. To simplify nota- 
tion we skip the vector arrow and use the same letter q 
for the defining function as well as for the functional it- 
self, that is q(y) — q q {y)- Generalized questions are fully 
defined by their answer distributions in the pure states 
f°^F°. As we always assume the model (of nature) to 
be in a possibly unknown but pure state the p(y q \q, /°) 
represents the real processes and we write all formulas 
within this section for the /°. This formulation with- 
out reference to some state of knowledge / fits also into 
the Frequentist interpretation of statistics meaning that 
the definitions of generalized questions do not require to 
specify an explicit model of /° and are therefore not only 
meaningful for Bayesians. This holds as far as we make 
the definition of a question not explicitly /-dependent. 
Thus, in the following we form products for pure states 
/° and not for states of knowledge /, 



p{y,y f \q,<i f J) 



Jdfp(f\f)p(y,y f \q,q f J°) 
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dfp(f\f)p(y\q,f)p(t/W,f) 

including the case q = x and q f = x f . The fact that 
p(y,y f \q,q' , /) has not necessarily to factorize enables 
learning. But notice, that we always assume the dif- 
ferent parts of a question to be measured in the same 
pure state /°. For /° we postulated independence, i.e. 
factorization 

p(y,y'\q,q',f)=p{y\q,f)p{y'W,f), 

meaning that different questions q, q f use different re- 
alizations of answers to x and that learning being in a 
pure state, i.e. / = /°, is not possible. In this formula- 
tion results depending on the same realization are seen 
as components of one question. On the other hand the 
following is easily adapted for a notation which allows 
such dependency between non basis questions. 

A simple example is a question asking for the sum of 
two independent a? z — measurements with answer proba- 
bility 

p(y\q, f°) = p( {yi + V2 = y} ki, x 2 , f°) 



dyi dy 2 <%i + 2/2 - y)p(m\x 1 , f )p(y 2 \x2, f°), 



where <$ (2/1+2/2- 2/^) is the indicator function of the event 
2/1+2/2 = 2/^ an d as always in this paper the ^-functional 
has to be understood as Kronecker-<5 function 6 yil y 2 in 
the discrete case. 



Deterministic questions 

The last example can easily be generalized by replac- 
ing yi + 2/2 by any other defining function q(y) depending 
on a data vector y with components yi 23 

P(y 9 \qj°)=p({q(y) = y q }\x\f ) 



To simplify notation, we now use the symbol x q not for 
a single basis question but for a whole vector of them. 
Its components x q , correspond to the yi necessary to 
evaluate q and can include repeated measurements of 
the same basis question. Specifically, averages of actual 
measurements correspond to choosing q(y) — l/^V~^ z - yi. 
The answer y q to question q can be a vector with com- 
ponents y q - meaning that 

6( q (y)-y q ) = T{^(y)-yl)- 

3 

For noisy x this gives not the product 



because this would refer to a situation where the y q - do 
not use the same realization but sample their own yi 
resulting in additional y-integrals. 

While in the following no special attention is paid to 
the technical difficulties of the case of continuous y we 
shortly discuss the definition of the 6 for this case where 
6 stands not for the Kronecker function but for the 6- 



functional. This is defined for general q(y) for q £ y * 
according to 



± 



dyp(y\xJ)S(q(y) - y q ) = ^p(yv\xj) 



yo 



dq{y) 



dy 



where yo are the solutions of q(y) — y q , i.e. the zeros 
of the argument q(y) — y q . The nonzero first derivative 
guarantees that q(y) is locally invertible so we can write 
yo = q~ 1 (y q ) with q~ l defined at least in a neighborhood 
of y q . In case q(y) — y q is fulfilled on a whole interval 
the ^-functional has to be replaced by the characteristic 
function of that event which can be expressed by step 
functions 0. 



23 The components y % of the data vector as well as the an- 
swers y q can be vectors itself. 



Output noise 

A question can also be constructed including addi- 
tional internal random variables z by using a probabilistic 
defining function q(y, z) defined by ^-independent out- 
put distributions p(z\y,q) = p(z\y, q, /°). The answer 
can be seen as created in a two step process 



with 



and 



P(y q \q,f) = J dyp(y q \y,q)p(y\x q J°), 
P(y q \y,<l) = / dzp(z\y,q)6(q(y 1 z) - y q ) 

p(y\xq,f°) = Y[p(yi\xiJ°) 



The vector of internal random variables z increase the 
noise in answers while additional (repeated) measure- 
ments of the same basis questions reduce the noise. Note 
that the noise variables of different questions are inde- 
pendent, that is we assume p(z q ,z q /) = p{z q )p(z q i) for 
q ^ q f , and questions always measure their own y-values 
according to p(y\q, /°) and do not refer to measurements 
of other questions. One specific realization of y can be 
used multiple times only within the same question. 

Input noise 

The vector x of basis questions contained in q and its 
dimension can be an ^-independent random variable. 
We call this a situation with input noise. Then q(x,y) 
depends on x generated according to p(x\q). The index 
q indicates that this distribution is part of the definition 
of the functional q. Thus, we have 

/dim(a?) 
dxp{x\q) Y[ P(yi\xi,f ), 
i 

with J dx = ^2- J Y\l dxi and p(x) — p(dim = j) p(x\j). 
We understand all dependencies within the vector x to 
be included in this notation. Products of p(yi |a? z -, /°) cor- 
respond to a logical AND applied to the results yi, sum- 
mations over Xi and z correspond to a logical OR. Input 
noise can be combined with a probabilistic (/-function 
q{x,y,z) 



P(y q \qj°) 
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/ dyp(y\q,f°)p(y q \y 1 q) 

/ /dimO) \ 

dxp{x\q) / Y[ dyip(yi\xij°)\ p(y q \x,y), 

(1) 
where p(y q \x, y) — J dz p(z\x, y, q) 6(q(x, y, z) — y q ) and 

j dy — ^2j f Yll dyi- Finally, we remark that input noise 
has both more active and more passive interpretations. 
Changing the variable x can be interpreted as chang- 
ing the device or changing the situation. For example, 
measuring another location x of an object can be done 
by moving the measuring device to location x (passive 
interpretation) or by moving the object (active interpre- 
tation) or by moving both (mixed interpretation). The 
passive interpretation may be the usual one, but as we 
assume /° to be constant over the whole time of interest 
all active controllable aspects must be part of x. 



Causal chains 

We can also allow output dependent input noise by 
choosing the X{ depending also from results of previ- 
ous measurements yj included in q. For example, active 
learning algorithms which select the next question ac- 
cording to the data in the past belong to this class of 
questions. We order the factors according to the causal 
dependencies of the generating process like 

p(%, y\q, f°) = p(xi\q)p(yi ki, /°) 

xp(x 2 \x 1 ,y 1 ,q)p(y 2 \x 2 ,f°) 
x p(x 3 \x 2 , 2/2, x 1 ,y 1 ,q)p(y3\x 3 , f°) • • • 

= p(x\y c ,q)Y[p(yi\xi, f°). 

i 

where the subscript c indicates a 'causal' ordering, mean- 
ing p(x\y c ,q) = p(a?i|g)n 8 =2P( a? *l{ a? j'2/j}i<j<*-i^)' 
with (the 'last') one of the components of vector y miss- 
ing. Note that we understand the causal structure for 
components of x to be implicit in the notation p(x\y c , q). 
Chains with variables depending only on those of the pre- 
vious step p(x\y c ,q) = p(x 1 \q)Yl i=2 p(x i \x i - 1 ,yi- 1 ,q), 
are sometimes called Markov chains. 24 For finite se- 
quences such a representation can always be achieved by 
combining x, y of different steps into one vector variable, 
or in general by including the relevant memory variables. 
In the extreme case this would lead back to the starting 
point p(x\q)p(y\x , q). Including internal noise variables z 
their probability p(z\x, y) may also be written in a causal 
realization modeling the real causal processes 

p(x, y, z\q, f°) = p(x 1 \q)p(y 1 \x 1 J°)p(z 1 \x u y u q) 

xp(x 2 \x u y u z u q)p(y 2 \x 2 ,f)p(z 2 \x u y u z u x 2 ,y 2 ,q)'-' 

= p(x\y c , z c , q)p(z\x, y, q) Y[p(yi\xi, f°). 

i 

It is always possible to write a joint probability in this 
form and we always understand the indices of the vari- 
ables x, y, z to refer to the same ordering. This gives 

p(y q \qj°) = Jdxjdyjdz IlKwki./ ) 

X p(x\y c , z c , q)p(z\x, y, q) 6(q(x, y, z) - y q ) 
dx dy dzp(y\x,f°) 



x P(x, \y c , z c , q)p(z\x, y, q)p(y 9 \x, y, z, q), (2) 
showing the separation into (/-dependent factors written 
in index form from and /°-dependent factors. Special 
realizations of causal dependencies, including variables z, 
defining the answer producing process can be modeled 
by directed acyclic graphs, i.e. graphical models or belief 
networks (Pearl, 1988; Lauritzen, 1996, Jensen, 1996, 
Ripley, 1996). A pair of variables R q = (x,z) can be 
called realization of q because it represents that part of 
the definition of a question determining a specific answer 
y q and D q = (x,y) the corresponding data. We denote 
the set of all questions of the form of Eq.(2) including 



only finite dimensional vectors x, y, z by Q 



fin 
X 



24 See for example Golden, 1986. However the term Markov 
chain is used in many variations. See for example van Kam- 
pen, 1992, p. 77 and its footnote on p. 89 referring to a footnote 
on p. 340 in Feller, 1957. 



Combinations, decompositions 

Having defined a set Q f of questions q f we obviously 
can generate new questions q according to Eq.(2) by 
replacing p(y\x,f°) with p(y\q f , f°) and choosing any 
input functions p(q f \y c , z c , q), internal noise functions 
p(z\x,y,q), and defining functions q(x,y,z). To avoid 
circular definitions the q must of course not contain it- 
self. If the new q is expressed in terms of basis questions, 
i.e. the p(y\x, /°), it still has the form of Eq.(2), meaning 
it belongs to Qx • Consequently, for arbitrary q' E Qx 
a q of the following general q-form 



P(y\q,f°) 



/ dq' dy dzp(q'\ 



,3) 



x p(y\q', f°)p(z\x, y, q)S(q'(x, y, z) - y) 



(3) 



is also in Q^ . In this sense Q 



fin 
X 



closed. Also for 

every q E Qx there are q' E Qx so that q can linearly 
decomposed in that form. Specifically, the q 1 can always 
be chosen as x. 

In general a decomposition of q in arbitrary compo- 
nents q f can be written as 

p(y\qJ ) = jdq' P (q'\q,f ) P (y\q',q,f ), 

with /°-dependent p(q'\q, /°) if q f is sampled depending 
on y. We can get a y— and therefore ^-independent 
decomposition of q into lower noise components q° in 
analogy to ^-independent decomposition of states / into 
pure states /°. To get this, we separate R q — (a?, z) into 
a y-independent part q° and a y-dependent part. Note 
that at least one X{ is in q° and arbitrary dependencies 
are allowed within q° . Then we have an ^-independent 
decomposition of q 



P(y\q,f°) 



jd q p( q \ q )p(y\ q , q ,f), 



(4) 



in analogy to a ^-independent decomposition 25 of / 
p(y\x,f) = jdf°p(f \f)p(y\x,f ), 

but with, in general, differing 'local states' p(y q \q°, q, /°) 
for different q. Despite that analogy we do not assume 
the availability of data which can change the p(q°\q). 
This is without loss of generality as all data dependencies 
of question definitions can be incorporated by enlarging 
the space .P and adding questions with answers depend- 
ing on p(q°\q). In fact, in practice an example generat- 
ing distribution p(x) is often unknown and estimated 
using the sampling distribution. Formally we can com- 
bine x and y into a new y f = (x, y) and define functions 
f'° = (f x J°y) by piy'W, f'°) = p(y\x, x>, f y )p(x\x\ /°). 
The new ^'-variable can be skipped from the notation 
having only one value with the minimal meaning of re- 
questing another y f = (x,y) value. A state of knowl- 
edge /' contains now also information about p(x) and if 
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For simplicity we use from now on the integral notation 
also for the ^-variable assuming a well denned (possibly 
discrete) measure. 



p(f f0 ) = p(fx)p(fy) we can estimate p(x) independent 
of p{y\x) by observing pairs (x,y). But note that for 
continuous X if not parameterized by a finite set of pa- 
rameters the /^-integration is a functional integration 
and not necessarily well defined. 

If studying /-dependency we may also allow /- 
dependent definitions of q as the state of knowledge is 
assumed to be known and therefore this knowledge could 
be incorporated into the answer producing process. The 
preceding equations include this case if q is interpreted 
as double index q — » (q, /) = qj . Notice that as answers 
change the state of knowledge / and they also change 
the definition of q. 

General functionals 

We allowed the answer probability p(y q \q, f°) at y q 
of a generalized question to be any functional of the y- 
distributions p(y\x, /°) (for all x but fixed /°). A param- 
eterization of, or in the extreme case the p(y\x,f°) for 
all x and y itself, describe a state /° completely. Func- 
tionals depending on the p(y\x, /°) include for example 



P(v q \q,f°) = 6(y q 



dyyp(y\xj )), 



giving the local expectation, or in a noisy version 

p(y^\q,f)<xe-^ q -S d yyp^'n)M\ 

as well as 

P(y 9 \q,f) = 6(y 9 -p(y\x,f)), 
giving a specific probability (density) or general 

P(y q \qj°)= I dyp(y\f°)8(y* - q(z)). 



We recognize that the questions we have constructed so 
far in the previous sections contain products of p(y\x, /°) 
(together with a sum implicit in the notation of the last 
sections) and are therefore functionals with a finite power 



p(y\q,f) 



E 



f[J dXid y M ^f))a n ( X ,y). 



Those generalized questions can be measured using a 
finite number of answers to basis questions, while ques- 
tions with an infinite but converging power expansion 
might be measured approximately. Expectations and 
probabilities of events can sometimes be approximated 
using empirical sums but in general not a probability 
density p(y\x, /°) for continuous y. We will use the word 
measurement for the process of getting an answer to a 
question, but we will show in the next section that for 
questions with no finite or converging power series this 
has more the character of enforcing an answer or active 
control of F° . 

3.2 How to measure them 

An answer to a question could be obtained using only 
measurements of basis questions x when it has a defining 
function q(x, y, z) depending only on a finite number of 
outcomes yi(x{). 



3.2.1 Finite case 

We denote the set of questions with finite dimensional 
y, z by Qx • With Q denoting the measurable questions 
we have therefore ICQ=> Qx Q Q- These questions 
could (but do not have to) be measured by a finite num- 
ber of possibly repeated measurements of basis questions 
x E X according the following steps: 

i. Choose x according to p(x\q), 

ii. Get y for the basis questions in x, 

iii. Get z according to p(z\x,y), (Repeat i., ii. and iii. 
in case of p(x\y c ,z c ) dependencies) 

iv. Insert result into q(x,y,z). 

While the measurements can be performed using only de- 
vices measuring basis questions, these questions do not 
necessarily need the full information contained in the 
answers to x. That means measuring devices designed 
specifically for them could be more effective. For exam- 
ple, using interference of waves differences can be mea- 
sured in physics sometimes much more precisely than 
absolute values. 

We discuss how in the finite case a direct measuring of 
q might be preferable, a situation which often occurs in 
practice. Consider output noise which is added in a final 
step for all measurements during training, like observa- 
tion and memory errors, or transformation to a lower 
resolution final output scale. Then measuring devices 
which are able to directly access the underlying lower 
noise function and perform the necessary operations be- 
fore adding the output noise have higher accuracy. As 
simple example, take a basis set X with output distri- 
butions 

p(y\x,f) = N(Mf),°) 
and 

^(f°) = jdyyp(y\x,f°), 

where N(jj, a) stands for a Gaussian centered at fi with 
variance a 2 . Then, if measurable, the generalized ques- 



tion 

P(y\qj°) 



7V(^(/°),<7), V q =lA Xl (f)+lA Xa (f), 
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obtains the sum with greater accuracy than using the 
sum yi + j/2 of two basis questions x\ and X2 where the 
independency of the noise gives 2<r 2 for the variance. In 
this special case the question would be equivalent to four 
basis questions including one repeated measurement for 
every X{ to get (y\ + y[ + 2/2 +2/2)/^- Replacing the sum 
ftx x +^ 2 by an integral J fi x dx (infinite nonlocality) or 
setting a in q equal to zero with retaining a finite a for 
the Xi (infinite accuracy) the question depends on more 
than a finite number of answers to basis questions. 

3.2.2 Infinite cases 

Asymptotic procedures 

In some cases a defining function can be found so that 
a well defined limit with the number of arguments go- 
ing to infinity gives the desired p(y q |g, /°). For example, 
there are measurements depending on a formally infinite 
number of basis questions using some inherent, infinite 



parallelism of natural processes or a process having a 
well defined limit for the number of involved basis ques- 
tions going to infinity. Scattering of waves on structures 
with specific translation and rotation invariances (crys- 
tals) create filters for specific Fourier components which 
depend on a (conceptually) infinite number of coordi- 
nate values. Measurement of macroscopic variables in 
physics, like energy or magnetization depends on many 
(normally in the order of 10 23 ) microscopic variables. 
If the answers to those questions converge in the limit 
system size n — » oo, they can be considered as mea- 
surement of an infinite system. In general, a meaning- 
ful statement about an infinite property, has ro rely on 
conditions which can be checked in finite times. In the 
following we relate such conditions to the preparation of 
an ensemble and possibilities of control. 

Measurement and preparation 

We begin with the observation, that questions avail- 
able during the training situation might be different from 
those during the preparation of the ensemble /. 

Consider objects /° which, giving a parameter y, pro- 
duce as random output an output distribution with mean 
y. We can form a population F° of such objects, notate, 
i.e. measure, the input parameter y for each member, 
and select, with a given (prior) probability p(f°) = p(y), 
one object /° with unknown y. Then, we cannot mea- 
sure the mean exactly using only training example y z - of 
the output of/ . In cases the mean exists, the sampling 
sum will, according to the theorem of Glivenko-Cantelli, 
usually asymptotically converge to it, (i.e. the probabil- 
ity 6 for the deviation to be larger than some e can be 
n-dependently bounded). Thus, what have been mea- 
surable during the preparation phase by a single mea- 
surement (i.e. reading the number y) is, if at all, only 
asymptotically measurable during training. 

In a general situation we consider a family of parame- 
terized answer distributions F° with known parameters 
like expectation and variance. If we can measure the 
values of these parameters we can prepare a prior dis- 
tribution (state /) depending on those parameters. For 
example, if a certain symmetry property is measurable 
we can choose an / depending on this property, for exam- 
ple with possible states /° possessing (or concentrated 
around) a certain value of this symmetry. In a specific 
training situation, however, the symmetry may not be 
directly measurable, and an additional source of noise 
be present. 

We will denote the set of questions with available 
answers during preparation by X p and will call it the 
preparation set. The preparation set X p is in general 
not equal to the set of training question X considered 
actually available. Questions not finite or asymptotically 
measurable with respect to X can be finite or asymptot- 
ically measurable with respect to X p . 

As far as generalization requires nonlocal questions 
not measurable with respect to X, like a symmetry or 
smoothness constraint for infinite X, there has to be 
another set X p to allow the necessary measurements to 
prepare the prior. Then the prior can be expressed by a 
question q p depending on a finite number of questions 
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x p E X p and the corresponding answer y q 

p(f°\y P ,q P )cxp(f )p(y p \q p ,f ). 

However it could not be written using a question de- 
pending on a finite dimensional vector x E X of training 
questions. The training data change the prior according 
to 

Preparation questions x p with deterministic answers, 
like an exact symmetry, correspond to restrictions of F° . 
Probabilistic q p give probabilistic priors, like a prefer- 
ence for smooth functions or otherwise symmetric func- 
tions (approximate symmetries). 

The distinction between a (momentarily considered) 
training set X and a preparation set X p (on may say 
implicit training set) is more a practical than a formal as- 
pect, For analysis of the generalization ability questions 
from X p have to be treated formally equal to those from 
X. Thus, effectively one deals with training questions 
depending on X U X p . 

Specifically, we will explain below how symmetry and 
smoothness can be generated by input noise or averag- 
ing. To recognize or prepare such a situation a measure- 
ment device without (or less) input noise (X p ) must be 
available. 

We summarize, that what appears to be not measur- 
able with a finite amount of data for X may well be 
measurable within the larger set X U X p . This is the 
case, when the set of actually considered training ques- 
tions X all share a common noise source, which is absent 
for X p . 

Measurement and control 

To enable learning we have to assume stationarity of 
p(y\q,f°). Thus, all factors changing the answer distri- 
butions have to be included in q and /°. But the model 
does not specify how stationarity is achieved. In prac- 
tice, stationarity can result from an active control or just 
by not disturbing a constant part of nature. 

In general, a measurement y of q in state /° is the 
result of an interaction of the 'active' part of posing a 
question q and the 'passive' reaction of nature in state 
/°. Usually, measuring a quantity q emphasizes the pas- 
sive picture where the value of y reflects a permanent 
property of nature /°, however, only seen when mea- 
suring q. For preparation questions the complementary 
active interpretation of measuring as control or selec- 
tion may also be helpful. Then, the answer is seen as a 
property of nature enforced by the question q. In this 
interpretation a question is better called a control ac- 
tion and a measurement device a control device. Thus, 
y can be seen as reaction of nature to q. In the general 
case of stochastic control different states /° react differ- 
ent to the control action g, described by p(y\q, /°). This 
point of view is especially suitable if the variability of y 
between different /° is small under q. Both interpreta- 
tions describe the same formalism, the difference being 
the larger emphasis of either the passive or the active 
part. 



Interpretations may also refer to a more complicated 
model of interaction between q and /° . Such more com- 
plex models correspond formally to the introduction of 
hidden variables z, 

P(y\q,f ) = y^p(y\q,f ,z)p(z\q,f°). 



That means we think of measurement devices as gener- 
alized questions with respect to some underlying set of 
measurable questions. Then, the x can be called effec- 
tive questions x e ^ with respect to another, underlying 
X . If we do not want just to assume such a structure, 
we need other measurement devices, we may call them 
in analogy to the last Subsection q p E Qjp, which allow 
to measure or control a given structure. We will see that 
this can be reasonable to assume and those additional 
measurement devices will have to be active only a finite 
number of times. If we can guarantee the structure for 
all q, this can be equivalent to dependencies between q 
and therefore interpreted as nonlocal measurement. 

For example one can imagine the value y to be the 
result of an (ideal) measuring process with a following 
control device, e.g. a filter describing restrictions of the 
(real) measurement device. For this we separate the in- 
dex q into two components q = (q tdeaI ; qf tHer ^ ( Q r alter- 
natively f° = (f.datajOJilter^ and haye 

p(y\q,f)= E p(yk !llier ,y ldeal )p^ 



We will call this a model of posterior control. 

Analogously, we can think of a scenario where first 
the functions are filtered and then measured, i.e. we 
split q into q = (q data ^qf llter ^ ( Q r alternatively f° = 

(f.datajOJilter^ an( J nave 

P(y\qj°) 

f° 

J filtered 

If we /° from the notation, write again /° for Jeered > 
the filter qf tHer = (y P ,q P ) creates a prior ensemble. 
Here the q independence allows us to have it prepared 
before the training starts. That is what we will call a 
model of prior control. The situation we discussed in 
the last Subsection can therefore be realized as prior con- 
trol. Notice, that also under prior control stationarity of 
the probability distributions must be controlled, and in 
this sense there is always a posterior control component 
present. Adjusting the measurement device and repeat 
the measurement if some 'failure' is indicated is an ex- 
ample of such an posterior control, which is also present, 
if the prior ensemble is prepared in the past. 

Now we turn again to the problem of questions de- 
pending on an infinite number of basis questions. In the 
last Subsection we used a decomposition q' — (q,q p ) 



The key observation is that the stationarity condition 
for p(y\q, /°) only affects an always finite number of mea- 
surements. Hence, also (e.g. prior or posterior) control 
can be restricted to this finite number of measurements. 
This allows to understand or implement stationarity as 
part of the measuring process, e.g. as a filter. For exam- 
ple, nothing prevents us to assume that asking q the first 
time causes stationarity for the following times. There- 
fore, posing a question q can have the interpretation of 
an active control leading to some restrictions or depen- 
dencies for subsequent measurements, stationary for all 
f°. Differentiation between different /° is only possi- 
ble if the controller, but not necessarily the learner, has 
questions, q E QC", available to distinguish between 
them. 

As example, take a family of q measurement devices, 
all capable only of producing answers smaller than 0, 
independent of f°. This can be ensured by one q— 
independent cutoff device as posterior control. This cut- 
off device depends itself only on finite dimensional in- 
coming y ldeal and has to be active only during the finite 
number of measurements. Nevertheless, the posterior 
control model is equivalent to a nonlocal measurement 
for all q, which is possibly an infinite number, with an- 
swer q < 0. Here, the control is implemented by /°- 
independent properties of the measurement device. The 
same effect appears if all objects under study underlay 
the same ^-independent selection. So instead of using 
restricted measurement devices as posterior control, we 
could also study restricted objects or situations, i.e. an 
implementation as prior control. Then the controlling 
filter acts using q p E Qx p n °t on measurement values 
but on the objects or situations /° itself. In analogy to 
the output bound above, the length of object classes 
under study might be restricted by because they all 
arrive in the same box, must fit into a certain environ- 
ment or are only produced in that way. Those are prior 
control devices parameterized by 0. Also, if the number 
of objects is infinite, the prior control device (e.g. the 
box into which the object must fit) must only be active 
during the finite number of measurements. 26 

Thus, because control of stationarity, be it related to 



\fin 



\fin 



with q E Q x ? Q €= Q x p an< ^ snowe d that this allows 
q P (£ Qx • Now we examine how also in the example 
of posterior control measurement of questions depending 
on an infinite number of basis questions is possible. 
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There can be practical differences between such devices, 
for example between a 'box' device, and a cutoff in the out- 
put scale. Indeed, a filter can be implemented as active or as 
passive filter. A passive filter does not always return an out- 
put. For values larger for example it may answer 'overflow'. 
For example, using the box device it may take considerable 
time to find an object which fits into it. There might even 
be no objects available which fit in the box, and we could 
need a formally infinite time to produce such an impossible 
state. An active filter always returns an output. For example 
we may assume for a cutoff device that it always returns 
in case the output is larger 0. However, this has nothing to 
do with the difference between prior and posterior control. 
Both variants are also possible for prior control. Consider, 
for example, an 'active box' cutting everything to a fitting 
size Those aspects of differing complexity of single measure- 
ments or control actions are not included in the formalism. 
Yet, nothing prevents us from modeling the micro-structure 
of single measurements if necessary; we could use the same 
type of theory, just on another level. 



training or preparation questions, only has to be active 
at the always finite number of times of actual measure- 
ments, there is no practical impossibility in measure- 
ments based upon control depending on an infinite num- 
ber of basis questions. Nothing is actually done an infi- 
nite number of times. 27 It just could be done arbitrarily 
often because we defined the situation to be so. A filter 
bounding outcome values only has to be active a finite 
number of times for every actual measurement even if 
we defined the situation that it could do so arbitrarily 
often. This includes also the stationarity conditions for 
local questions. 

We have seen that what exactly choosing question q 
means in practice and how stationarity is achieved, be it 
by leaving nature alone or by active control does not en- 
ter the formalism. Dependency on an infinite number of 
questions is no practical impossibility in as far as control 
over stationarity conditions only has to be active at the 
always finite number of times of actual measurements. 

Another aspect of control will be discussed in more 
detail in the next Sections: Often the presence of con- 
trol is easy to recognize, however difficult to formalize. 
For example, creating a training set for a object recog- 
nition task by choosing images as training examples of 
faces and non-faces or chairs and non-chairs, depends 
on one's implicit definition of the concept of a face or 
chair, respectively. Drawings may be accepted as valid 
examples of the object class or not, while very unregular, 
random-like objects are not selected to represent a face 
or a chair. Consequently, the definition of such object 
classes is related to linguistic or implicit concepts rep- 
resenting the objects. Such defining concepts, involved 
for example in a prior or posterior control process, can 
correspond to a measurement of an infinite number of 
objects. Thus, it can be expected to be helpful having a 
method to formalize those concepts and include them as 
prior information into the statistical inference process. 
Accordingly, useful restrictions might be found from an 
analysis of the application situations or of the object or 
situation generating process: a pedestrian detection task 
might be restricted to pedestrians walking on or near the 
street in a certain distance and a car (ship, airplane, • • • 
) detection task may take into account that only certain 
types of them have been produced up to now. 

As man are always more or less involved in the def- 
inition of situations or problems of interest this shows 
the clear need of a human interface which enables hu- 
man knowledge, like linguistic concepts, used to define 
and control the application situations to be incorporated 
into the learning process. 

Summarizing, we list two variants in which measure- 
ment by control depending on a possibly infinite number 
of questions can appear: 

1. prior control: controlling the process which gener- 
ates the prior distribution, e.g. control of the situ- 



2 This is related to a constructive view of infinity, not at- 
tributing 'existence' to an abstract infinite object itself, but 
to its constructing procedure. This is, for example, the po- 
sition of Jaynes, 1996, who 'sails under the banner of Gauss, 
Kronecker, and Poincare rather than Cantor, Hilbert, and 
Bourbaki.' 



ations or objects of interest, 

2. posterior control: controlling the measurement val- 
ues, e.g. restrictions of the measurement devices. 

Stationarity is essential for all questions and the inter- 
pretation of measurement as control or active enforcing 
of stationarity can be applied to all questions, not only to 
those which cannot be measured using a finite number of 
basis questions or some asymptotic procedure. One can 
say that the ability to generalize is based upon the ability 
to control that the situation of interest and the measur- 
ing devices are kept within their restrictions. We have 
discussed, that the probability concept does in general 
not allow to exactly verify (even local) nondeterministic 
conditions using only the training data and no prepara- 
tion questions. Approaches to test the conditions, i.e. 
the models /, using training data have to refer to meta 
models or use the classical testing of null hypotheses, i.e. 
calculate the probability of the data given the condition 
/. Applied to model testing this has also been called 
evidence approach (MacKay, 1992c). 

We conclude, that the ability to generalize is based on 
the ability to measure or, emphasizing more the active 
connotation and formally equivalent, to control depen- 
dencies between basis questions. We have shown that 
this is also practically possible in case of an infinite X . 
From this point of view only measurement can generate 
new information. Actual learning, however, is not the 
discovery of something new, but the reformulation of ac- 
tually imposed and measured conditions to give answers 
to relevant questions. Analogously, assumed learning is 
the reformulation of assumed conditions. 

4 Priors 

We discuss two examples of generalized questions often 
used as priors and give then in the next Section a general 
method to construct priors. 

4.1 Bounds 

In practice measurement values have a bounded range 
and even models using distributions with unbounded 
range for the variables use normally a bounded range 
for moments, like the mean. To discuss such bounds 
we present some variations of questions with (/-functions 
calculating the maxima of different sets: 

1. empirical maximum from a finite sample 

mi = maxyi(xi,f°), 

2. maximum of the local expectation (regression func- 
tion) 



m 2 



dyyp(y\x,f°) = max^(/°), 



allowing observed values y z - > 
noise, 



7112 generated by 
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3. outcome (answer) being maximal with probability 
one 

ms = m&xy* (x), 



defining as local maximum y*(x) at x the minimal 
y with 

Vj/eY* ■.p(y'>y\x,f°)=0, 



4. maximal potential outcome 

??i4 = max max y 

x y£Y x 

being independent of /° if the Y x are defined inde- 
pendent of /°. 

Questions related to mi belong to Q^ n and can be em- 
pirically measured if X can. In approximation problems 
where one is interested in modeling the regression func- 
tion y x (f°) and not /° itself one normally refers to rri2 
and allows for example Gaussian noise still to generate 
arbitrary large outcomes even for finite m^. The answer 
to 7112 can be bounded only with access to the regression 
function, which in this case is interpreted as the true un- 
derlying function. This 'bounding device' has therefore 
to be applied before the measurement noise. The bound 
ms is the most interesting one in worst case considera- 
tions and is itself bounded by m^. Fixing (the answer to 
question) m^ is done by using cutoff devices. A question 
qcut corresponding to a cut-off device applied to (real) 
answers to question q can be written 



p(v eut \<iT,f) 



dy[ P (y>A\q,f°)8(A-y cut ) 



+p(y<A\q,f°)6(y-y cut )} 
dy P (y\q, f°)[6(e(y -A)- 1)6(A - y cut ) 

+ S(e(A-y)-l)S(y-y cut )}. 

The step function Q(x) is defined to give 1 for x > and 
for x < 0. If only those cutoff questions are available 
we can restrict to effective states /°-- defined by 

p(y cut \qrJ )=p(y\QA,f° eff ), 

. We need this device for the next example. 

4.2 Approximate invariances: Smoothness and 
symmetry 

Smoothness and symmetries are probably the most im- 
portant and most often used nonlocal priors. Here we 
want to show how restrictions of the measurement device 
(or equivalently restrictions in the situation (object) gen- 
erating device) can lead to bounds on smoothness and 
symmetries. 

Let us assume a group S represented by operations s x 
acting on x. For simplicity we skip the index x if not nec- 
essary and write simply s. Then for scalar states /° the 
action of operation s is defined by requiring invariance 
P(y\x,f°) = j p(y|sx,s _1 /°) that is 

p(y\x,sf)=p(y\s- l x,f). 

Generalization to states with components not invariant 
under s, like for example vector states, is straight for- 
ward 

p(y\x, sf) = p(s y y\s~ 1 x, /°), 



where s y is the action of s in the ^-representation of the 
group, which may be different from the ^-representation 

Measuring invariance of the regression function under 
a set of operations S can for example be done by cal- 
culating a mean square error (weighted with w(x,s) if 
necessary) writing E(y(f° , x)) = y x 



dx / dsw(x,s)(y x - y sx f 



4(/°) 



where s denotes also the index of operator s £ S and 
the integral notation valid for continuous (Lie) groups 
has to be replaced by a sum for finite groups. (See 
for example Ferraro, 1992, for Lie groups in pattern 
recognition). A distance ds can be used to construct 

priors, for example like p(f°) oc e~ cds ^ \ if normalis- 
able in F° , or p(f) = Q(cP max - rf|(/ )). A relative 
difference is obtained using w(x,s) — (l/(x — sx)) 2 . 
One could also measure higher order differences like 
{{y x ~ y sx ) ~ {y sx ~ y s ^x)f corresponding in the infinites- 
imal case to second derivatives. Measuring smoothness 
is the special case of measuring infinitesimal transla- 
tional invariance sq. For example, written for the one- 
dimensional case with w(x, s) = S(s — s o)t~~~w 



4(/°) 



/""'M<§> ! 



The ^-integration p(x)dx can be written in the form 
^2if orbits p( s ' x o)ds f , with the orbit i of x l defined as Sx l 
and the sum is restricted to one x l per orbit. If there 
is only one orbit, all x can be generated as x = s x xq 
out of one xq by repeated applications of the infinitesi- 
mal sq with s x = e xs denoting the corresponding finite 
transformations . 

For discrete symmetries integrals become sums. Ex- 
amples of discrete symmetries include permutation of 
components in case x is a vector. Function spaces of 
functions depending on vector arguments x can be con- 
structed as tensor product of function spaces depend- 
ing on the single components of x. If every compo- 
nent of x corresponds to a measurement of another ob- 
ject, exact permutation symmetry means indistinguish- 
able objects. 28 

Now, let us assume that there is input noise or aver- 
aging associated to operations s E S of some group S. In 
spatial systems this is often also called coarse graining 
(see for example Balian, 1991, Goldenfeld, 1992). This 
is a very natural assumption for smoothness or infinites- 
imal translational invariance, as no real measurement 
device has infinite resolution. If only questions with in- 
put noise with respect to S are available we can define 
an effective state /° - - . Including the identity into the 
set S it is defined by 



p(y\x eff J )=p(y\x,f eff )= / dsp(s\x)p(y\sxj ), 

with the input noise characterized by p(s\x), that is the 
probability of posing question sx instead of question x. 
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28 For example, in physics identical particles like bosons are 
related to an exact permutation symmetry. 



The following is an example of averaging with respect to 
S by the same weight factor p(s\x) 

p(y\x eff J°) = p(y\x,f° ff ) 
dy s Y[p(y s \sxJ°)6(^2p(s\x)y s -y). 

s s 

Now, we are interested in bounds on the approximate 
symmetry of the effective regression functions 



mfeff^)) 



dyyp(y\xj® ff ) 



dy / dsyp(s\x)p(y\sxj°) 



y e J ! 



dsp(s\x)E(y(f°,sx)) = / dsp(s\x)y sx , 

for the input noise version. The average versions give 
the same expectation but smaller variance. The result 
is a convex combination of the local averages y sx . Now 
we consider a measurement device with finite range (for 
its real valued components) of output and define accord- 
ing to the last section a new effective state which in- 
cludes the cutoff with respect to an upper bound A x 
and lower bound B x . Then the effective regression func- 
tion can only take values between the extremal points 
A x - min s E(y(f°,sx)) and B x - max s %(/ ,^)) 
and by changing p(s\x) we can obtain any value in be- 
tween. 

Analogously, differences are bounded. If we take for 
simplicity p(s\x) — p(s\x r ) — p(s), we have 

\yl !! - yt f J I = \E(y(feff , *)) " EWeff , **))! 



ds\p{s t ) - p(s's l ))y s ' x \ 



for a parameterization s f with ^ = 1. Therefore, 
changing p(s) allows to obtain any bound d e jj for the 
norm of the difference between d™f a f = max r m&x((A x — 

B x ), (B x - A x )) and df}f = (for p(s) = const.) and 
we can achieve 

4(f° eff ) 



dl(f eff )Q(cdl n 



■4(f°eff))> 



with J dx J dsw(x, s) = c. This bounds the smoothness 
or deviation from perfect symmetry by < ds < cd e jj 
for any = df}f > d eff > d™f. 

Thus, input noise or averaging in connection with a 
cutoff device can lead to dependencies between answers 
of an effective function by bounding their maximal dif- 
ferences, which is a symmetry or smoothness property. 29 



For example, the support vector machine (Vapnik, 1995) 
applied to classification problems can be seen as such an ap- 
proach. Here the input space is embedded in a (often much) 
higher dimensional feature space. The classification in fea- 
ture space only requires the calculation of scalar products 
which are denned through a positive definite kernel, chosen 
so that it is easy to calculate, In feature space a linear sep- 
arating hyperplane is constructed maximizing its distance to 
the nearest data points, which are also called support vectors. 



For sampling of symmetry priors by virtual examples 
and references see Section 8.4. 

In real measurement devices these combination of cut- 
off with input noise or averaging (coarse graining) can 
occur on many different levels and it is an interesting 
question whether the omnipresent smoothness phenom- 
ena in nature could be partly explained in that way. 

5 Subjective priors 

5.1 General priors: How probabilistic models 
are obtained 

The preparation of an ensemble F° with a certain prior 
distribution requires measurement or control of certain 
properties of /°. States /° are defined in terms of 
a parameterization of p(y\x, f°). Any prior p(f°\f P ) 
describing a state of knowledge (i.e. state of prepa- 
ration) is therefore a deterministic functional of those 
p(y\x,f°), and also of their parameterization. Thus the 
set of p(y\q,f°) itself are answers to the maximal set 
Xmax- The xf nax are the deterministic functions giving 
p(y\x,f°). This means the number p(y\x,f°) for given 
a?, y, /°, not the probabilistic questions x giving answer 
y. The preparation questions X p ax allow to construct a 
set of well defined p(y\x,f°) producing devices. Usually 
then one of these devices is selected according to some 
p(/°), and one has to find out using new data D which 
one is actually chosen. With respect to X^ ax every prior, 
even if defined directly by a deterministic functional of 
the p(y\xj /°) can be reinterpreted (but not in a unique 
way) as resulting from a uniform prior p(f°) with addi- 
tional given data (y p ,q p ). We take the point of view 
that every nonuniform prior is caused by such data D, 
sometimes also denoted by ,0° if we want to distinguish 
them from other training data D. 

Given data D the corresponding state p(f°\D) can be 
calculated if p(D\f°) is known. The p(D\f°) are part of 
the structural knowledge. They must be determined in- 
dependent of the actual task under consideration. Thus, 
their knowledge is always a transfer of knowledge from 
another task and assumes constancy of this distributions 
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Choosing the separating hyperplane with maximal distance 
to the nearest sample points is equivalent to maximizing the 
input noise around the sample points without changing the 
classification. The cutoff consists in the restriction to data 
within a certain radius. One may interpret the class bound- 
aries resulting from equal class priors and a class membership 
probability of the form of a mixture of Gaussians centered at 
the data points, radially symmetric in feature space with re- 
spect to the distance induced by the selected kernel, and with 
equal variance <r. From this point of view one can say that 
the support vector machine obtains a solution with a max- 
imal 'smoothness' of the class membership probability with 
respect to kernel induced distance in feature space, i.e. with 
maximal <r. Thus, the support vector machine implements 
a smoothness prior relative to the feature space. One may 
remark here, that the related VC dimension of the support 
vector machine can (up to now) not be calculated exactly, 
because the related function space F of optimal hyperplanes 
is not denned a priori but dependent on the ^-values of the 
training data. (See Shawe- Taylor, Bartlett, Williamson, An- 
thony, 1996ab and their concept of a luckiness function.) 



under transfer. Up to now we assumed the p(D\f°) to be 
given. Then the preparation of an ensemble p(f°) can be 
related to measurement devices with already known (em- 
pirically measured) answer probabilities under the vari- 
ous states. Here we discuss the measurement of p(D\f°). 
The main problem is, that we have to measure p(D\f ) 
in another situation, and therefore have to ensure its con- 
stancy (or make its approximate constancy plausible) to 
allow transfer. 

Measurement or control 'devices' include humans e.g. 
an ensemble is prepared under human control or de- 
scribed by verbal statements. So, images of a training 
set may be labeled 'chair' or 'non-chair' by some 'expert 
in chairs'. To allow any meaningful generalization we 
must at least approximately (maybe implicitly) trans- 
form its concept of a chair into a chair approximator, 
applicable to all possible images, which then can be im- 
proved by training examples. Of course, one may rely on 
some smoothness condition or other implicit model re- 
strictions (corresponding to a special chair approximator 
for all possible images) and the training examples alone. 
But obviously, besides the training examples, any verbal 
description of a chair also adds to the available infor- 
mation and one may have good reasons to believe that 
there is a better 'universal first guess chair approxima- 
tor' than just using the implemented smoothness and 
other implicit model restrictions. Even if many training 
examples are available, it can help to use verbal descrip- 
tions to create also 'virtual' examples (which must not 
be an infinite set, assuming that the implicitly or explic- 
itly implemented smoothness conditions interpolate in 
the neighborhood), because the virtual examples can in- 
clude data which are not available as training data. They 
may teach the concept 'chair' far more effectively, than 
images of real chairs. Thus, we want to find a reliable 
relation of answers distributions of experts to possible 
states /°, i.e. approximate the expert answer probabil- 
ity p(y E \q E , /°). This is obviously a very complicated 
task and as well a subject of psychological research as of 
statistics. 

More general, one may even take the point of view 
that human experts are always involved: They have to 
describe 1. the single data probabilities p(D\f°) in an, 
already more precise or still less precise, verbal form, 
and 2. the dependency of the prior on the various data. 
E.g. single data have to be combined by AND, OR, or 
more complicated operations. These operations depend 
on the dependency structure of the single data. One 
must find a procedure to translate the related verbal in- 
formation into numbers. This may be rather trivial, if 
the verbal statement is a reference to a (maybe empiri- 
cally obtained) numbers. However, often this is not quite 
as easy. Assume we have data (q,T y ) available from 
some approximator T of p(y\q, /°). We might know that 
it has been trained on some examples. But what would 
be p(T\f°) and p(y q \T)l This clearly depends on the 
learning algorithm, the approximator was using, as well 
as on /°. We may not know the details of the algorithm, 
and determining p(T\f°) consistently to our model of 
basis questions may well be a much harder task than 
solving our actual problem. Thus, we have to perform 
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an approximate 'internal integration over (hyper-)priors' 
by considering our experience with this and similar ap- 
proximators in similar situations. The result will be a 
verbal statement describing a subjective concept related 
to p(T|/°), e.g.: 'We may trust the results of this simple 
approximator, not too much but a little in all situations 
which are not too similar to the examples A or B\ This 
has to be changed into a numerical representation, which 
includes e.g. translation of 'not too much', C A\ C B\ 'sim- 
ilar to A, B\ IF ... THEN, OR, AND. Still one can 
expect this to be in general a better solution than just 
ignoring unprecise information: Generalization requires 
implementation of nonlocal dependencies, and trying to 
match those nonlocal dependencies to unprecise verbal 
descriptions seems better than using nonlocal dependen- 
cies which are implicitly implemented without any fur- 
ther reasoning. Of course, there maybe individual situa- 
tions where the unknown implicit assumptions are better 
suited to the problem. But if this occurs regularly, the 
specific method used to include unprecise information 
has and can be adapted. 

In principle, every prior could be seen as resulting 
from one combined prior question, and therefore be con- 
structed like any generalized question. However, espe- 
cially for priors, but possibly for every question, the 
necessary ingredients have to be constructed from more 
or less unprecise information in verbal form. We there- 
fore shortly comment the standard operations AND and 
OR for probabilities from this point of view. Consider a 
prior, prepared by applying preparation question q p and 
enforcing or measuring an empirical answer y p . One ob- 
tains 

P(f \f p )^p(f°)p(y p \q p ,f )- 

According to the construction of generalized questions 
we can translate our verbal statement that the data (qf , 
yf) should be combined by an AND for conditional in- 
dependent events by forming the product 

p(/°i/ p )ocp(/°)np(«fi9/'./ )- 

But we might not be to sure about their indepen- 
dence, without knowing any explicit dependency struc- 
ture. Again we may delegate the task to an expert and 
translate its verbal output either in 1. conditional proba- 
bilities determining the dependency between y p and ap- 
ply the correct probability theoretical formula, or 2. di- 
rectly into an adapted combination of the p(yf\qf,f°) 
which might well look different from the above formula 
for independent events. 

Similarly, a partial sum (or integration for densi- 
ties) correspond to a (weighted) OR would translate 
statements, referring to measurement noise for disjunct 
events: 

K/°I/ p )ocK/°)Ep. ? K^I(z p ./ ). 



K/°l/ p )«K/ )Ep. ? K9fM^l3f./ )- 

The following statement 'The result was either 1 or 7, I 
am not sure.' would fit quite well into that category, pro- 
vided we can relate the p® to subjective beliefs. Again, 



instead of guessing p® and the dependency structure and 
applying the correct probability theoretical formula, one 
may translate a verbal statement directly into p(f°\f P ) 
and the resulting formula could look differently. 

In general, instead of constructing the ingredients 
from verbal statements and then inserting them in prob- 
ability theoretical formulas, one may also directly con- 
struct the result from unprecise knowledge. The results 
will differ if the subjective concepts do not represent 
probabilities. This is the case if the dependency struc- 
ture is easier formulated in a verbal form than in terms 
of conditional probabilities. On the other hand, one may 
try to improve verbal statements, by enforcing experts 
to use probabilistic models. In all cases we start with 
unprecise verbal information and require a probabilistic 
model at the end. The transition can be done on various 
levels. 

5.2 Fuzzy priors 

5.2.1 Human control and subjective 
probabilities 

Let us therefore consider in more detail the case where 
preparation and problem definition is under human con- 
trol and expert knowledge about the problem domain 
is available, for example in a verbal form instead in 
terms of a probabilistic model. Those situations rely 
more on a subjective instead of the empirical interpreta- 
tion of probability. 'Subjective probability', may simply 
mean the answer of an expert, which has been asked 
to give a probability. But those guesses do not need to 
be very accurate. In cases empirical probabilities are 
available one can compare subjective probabilities ob- 
tained by different methods with empirical probabilities. 
Typical tendencies of deviations of subjective estimates 
from empirical probabilities have been studied (Tversky, 
1972; Tversky & Kahneman, 1981; Kahneman & Tver- 
sky, 1979, 1982abc; Kahneman, Slovic, Tversky, 1983. 
See Lemm, 1984, for an example related to information 
costs in decisions). They include, besides many others, 
overestimation of small probabilities or of probabilities 
for easily retrievable, salient events, neglecting the base 
rate of events, the sample size or correlations, tendency 
towards a chosen reference point like underestimation 
for the sum of probabilities and overestimation for joint 
probabilities. Subjective estimates of probabilities do 
usually not obey the rules of probability theory, for ex- 
ample, independent obtained estimates for probabilities 
for events A and NOT A need not necessarily sum up to 
one. In addition to the difficulties caused by the devia- 
tions between subjective estimates and empirical proba- 
bilities it is often not obvious how to describe an event 
A to which a subjective estimate is related. A subjec- 
tive estimates for probability p(A) refers to an internal 
representation of A. To relate the estimate to a proba- 
bility for some external A the internal representation of 
A (e.g. concept of a mouth) must be related to the ex- 
ternal event (e.g. images of mouths). That means A has 
to be identified with a set of/ , described by an explicit 
parameterization. 

The process in obtaining subjective probabilities may 
be outlined as follows: We aim in producing a guess for 
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probability p(/°). Let us call any deterministic ques- 
tion C(/°) (i.e. function or functional, respectively) of 
the parameters of /° a property of /° and C\ its subjec- 
tive representation, a concept. We want to construct 
a final property C(/°) which is related to p(f°) by 
some function g, i.e. p(f°) = </(C(/ )). For example 
p(f°) oc C(/°), or p(f°) oc e c( J°\ The function g could 
also be used to compensate for known estimation biases. 
The subjective representation C used to produce subjec- 
tive probabilities might be difficult to relate directly to 
a specific function C(/°) of the parameters of /°. It can 
be easier for other, simpler concepts d. For example, 
the concept C of having something similar to a nose, a 
mouth and two eyes with certain possible spatial rela- 
tions, might be difficult to relate to pixel values directly. 
But for a simple enough concept d a property d might 
be more accurately related. 

A property d measuring distance can be built out of a 
property C[ and a related template TJ, using a monotonic 
function of a 'meaningful' distance of the expectation, 

c i = \\c' i (f°)-T i \\ 2 . 

G[ could be an arbitrary question. For d(f°) — Vx(f°) 
this measures the square distance of y x (f°) at point x 
from some reference template yj, i.e. ||^(/°) — ^1/J| 2 
and represents a usual mean square error term. Tem- 
plates could also be defined relative to another question 

l|C,-(/ )-C T(0 (/°)||, 

like in the case of symmetries, but also in cases where 
not invariance but any arbitrary dependence between Ci 
and Ctu) is measured. 

We will write d = G(Ci) for the relation between 
concepts and properties. Properties d related to (lin- 
guistic) concepts d have been called linguistic variables 
and have been used in the theory of fuzzy sets (see for 
example the collections Zadeh, 1987, or the more recent 
one, Zadeh, 1996). Subconcepts are modified and com- 
bined by concept functions F to form the final concept 
C . For example, we may require: 'great similarity to T\ 
and T2 if not already very similar to T3'. In analogy to 
concepts, also concept functions F{ must be mapped to 
functions F{ = H(Fi) acting on d. We will call func- 
tions F related to concept functions F linguistic func- 
tions. The mappings G for concepts and H for concept 
functions represent the subsymbolic level. The commu- 
nication of the (symbolic) structure of C in terms of con- 
cept variables and concept functions, i.e. C — F({Ci}) 
is used for the approximation 

C = F({C i }) = H(F)({G(C i )}), 

where H(F) stands for the function F with all included 
subfunctions replaced according to Fj = H(Fj). (See 
Fig. 3). We try to achieve 

G(C) = G(F({C t })) « F({Ci}) = HlFXiGlG)}). 

However, this is difficult to check in general if G({C}) 
cannot be obtained consistently in a direct way. Indeed, 
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Figure 3: The relation between (internal) concepts and 
(external) properties. For complex concepts C the re- 
lation to property C is often easier to obtain and more 
invariant under transfer to new situations, if the map- 
ping from concept to properties (subsymbolic mapping 
for linguistic variables) is done for simpler subconcepts 
Ci. This, however, requires also a (subsymbolic) map- 
ping of concept functions F, acting on concepts Ci, to 
property (external, linguistic) functions, acting on prop- 
erties C. Then, the property C for a communicated 
symbolic structure C = F({Ci}) can be approximated 
by C = F({C\}). 



this difficulty is the reason for the decomposition into 
subconcepts. One can instead define C by the right hand 
side and check variability of dependencies from those 
properties under transfer to new situations. Thus, this 
can be seen as a heuristic method to achieve dependen- 
cies with smaller variance under transfer. 

We will call priors obtained by this method fuzzy pri- 
ors. In the following we do not intend to give an intro- 
duction into fuzzy logic. We mainly want to stress that 
related techniques can be adapted to construct priors 
and the more technical point that this can lead to non- 
linearities in the equations determining the maximum 
probability p(f°) even if single d correspond to Gaus- 
sian probabilities. 

The method can be applied in two variants. The first 
way of defining p(f°\f P ) is directly to describe the func- 
tional in its dependence from the parameters without 
explicit reference to data. A second possibility is to 
define in a first step p(y p \q p , /°) in its complete de- 
pendence from all three variables and then choose in a 
second step y p to fix the prior. We will mainly study 
the first possibility and we will study linguistic functions 
Cc — F(Ca,Cb) depending on two one-dimensional 
properties. Their combination gives functions depend- 
ing on more than two variables. 



5.2.2 Real valued extensions of logic 

Linguistic functions can be constructed by fixing there 
values at certain, typical points and using some (e.g. 
smooth, symmetric) interpolation scheme. For exam- 
ple, practically important linguistic functions are such 
related to logical functions like AND, OR or NOT used 
in fuzzy logic (see e.g. Kandel, 1982, Klir & Yuan, 1995). 
From such functions we assume that they coincide with 
the corresponding binary logical functions at the four 
corners, i.e. where both arguments have a minimal or 
maximal value. We will mainly concentrate on this func- 
tions but also model combinations which do not cor- 
respond to binary logical functions at the four corner 
points. 

For example, we may assume high probability for a 
function f° if it is smooth at x\ AND x 2 AND • • •. In 
the second variant of the method we begin with the con- 
struction of answer distributions p(y\q, f°) and a prior 
results by choosing data (yf, qf). There, for example in 
face recognition, q p can be defined as question looking 
for components like eyes in the images y produced by 
p(y\x = face,/ ) and the y p can be examples of how 
eyes can look like. Requiring that f° produces (all of) 
several variants Vi of eyes, i.e. its output is V\ OR V 2 
OR • • •, means ANDing for constructing a prior, so f° 
can produce Vi AND V 2 AND • • •. 

Consider properties with < Ci < m, where we may 
allow the limit m — » 00. We use here a 'distance in- 
terpretation' of the values and m where we interpret 
the value m as complete absence ('far' or 'False') and 
the value zero as complete presence ('near' or 'True') of 
that property (deviation property). Monotonic function 
from (bounded) distances to some templates are exam- 
ples. The interpretation of and m can be reversed (sim- 
ilarity property), like in the usual convention in logic or 
for properties like probabilities where means 'False' or 
'impossible' and m = 1 'True' or 'sure'. Then, just the 
definitions of AND and OR in the following have to be 
exchanged. 

The following are examples of a function equal to a 
binary logical function when the arguments have values 
or m: 



a 



(A AND B) 

C(A OR B) 

C(NOT a) 



Ca~\-Cb - CaCbI™>, 

C A C B /m, 

m — Ca- 



(5) 
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These operations represent a Boolean Algebra (see for 
example Whitesitt, 1995) and therefore for variables tak- 
ing only the values and m for example DeMorgan's law 
A OR B = NOT ( (NOT A) AND (NOT B)) is valid. 

The functions are extended to real variables by allow- 
ing values < C < m in the above formula. Of course, 
this extension to real values cannot be unique. So AND 
can also be implemented as 

C(A AND B) — c m(CA + Cb)- 

where c m is a cutoff function with c m (x) = id for < 
x < m and c m (x) = m for x > m and therefore always 
c m (CA) = Ca- With this, DeMorgan's law gives for OR 

C(a orb) = ^- c m (2m - C A - C B )- 




A(l-A) +B (1-A) +B (1-B) A(l-A) 





A(1-A)B(1-B) 




Figure 4: Different realizations of AND and OR, in 
a 'distance interpretation' where stands for 'True' 
('near') and 1 for 'False' ('far'). Ca, Cb are abbrevi- 
ated A, B and the cutoff function c\ as c. 



Figure 5: Examples for real valued extensions of tautolo- 
gies. The last one has the formula 1 — (1 — A 2 (l — A) 3 + 
1 - B 2 (l -B)-(l- A 2 (l - A) 3 )(l - B 2 (l - B))) 100 . 



Not differentiate near our main point of interest at 
(0,0) where the 'good' f° should be located but m— 
independent are the following realizations 

C(a and B) — max(CU, Cb), 

C(a or B) - min(CU, C B )- 

This representation of AND and OR by the maximum 
resp. minimum operation are the standard fuzzy opera- 
tions used in fuzzy logic. 

A variable Ca can always be written Ca — C\/m ) 
Ca — %Ca — G\jm % Ca = c m (CU). Any (usually mono- 
tonic) interpolating function a(x) with <x(0) = and 
a(m) = m, allows to replace A by cr(A) without chang- 
ing the limit of binary logic. For example, OR could 
be defined as cfor{^a{Ca)^b{Cb)/^^) choosing some 
(monotonic) functions (Tor, &a, &b- Rules like the law 
of DeMorgan are valid for variables with values and m 
but not in general for real values. Fig. 4 shows two real 
valued extensions of AND and OR and Fig. 5 some tau- 
tologies which can be added to every function without 
changing the binary limit. Of course, any real valued ex- 
tension of logical functions is arbitrary except at the four 
corners. Thus, one might add additional conditions for 
such functions, like monotonicity or smoothness condi- 
tions, or the requirement that certain laws of the Boolean 
algebra, valid for the binary limit also hold for real val- 
ues, requiring for example an associative, commutative, 
and distributive AND and OR. Besides having the cor- 
rect boundary conditions, usually monotonicity, commu- 
tativity, and associativity are required for fuzzy opera- 
tions. (Such operations are called t-norm or ^-conorm 
for boundary conditions corresponding to AND or OR.) 

5.2.3 General combinations 

Every logical formula can be expressed by NOT, OR, 
AND. In particular, one could use the disjunctive or con- 



junctive normal form. Also, AND or OR could be elimi- 
nated using DeMorgan's law. For example, the exclusive 
OR is defined as XOR = (A AND NOT B) OR (NOT A 
AND B). For disjunct events XOR is equivalent to OR. 
Accordingly, real extensions of the exclusive OR are 

Cxor = — c m (C A + m - C B )cm{m - C A + C B )- 



C 



2CaCe 



xor 



C A -C B +', 



The latter result can be connected to Eqs.(5) if using 
C 2 /m = C{. Fig. 6 shows some possible variations of 
XOR. 

Another important relation is IF C\ THEN C2 = 
NOT C\ AND C2. One way to extend this to real values 
is 

C f / F = c m ((m-C f i)-h(7 2 ). 

The limit of binary logic requires function values 
or m at the four corners where the input variables are 
or m. Any function having more than two different 
values at those four points cannot correspond to a log- 
ical function. The only possible linear functions are for 
example 1, Ca, Cb and their negations. Therefore com- 
binations like <jCa + (1 — ci)Cb (LIN = linear) are more 
general and do not correspond to logical combinations at 
the four corners. However, setting a — Cc LIN appears 
to be a combination of three logical properties with the 
value of Cc fixed (for all /°). 

Thus, such functions can be seen as parts (of com- 
binations) of logical functions with certain values of Ci 
excluded (e.g. with the value fixed). Specifically, in the 
limit m —> 00 and d finite we obtain for the above rules 
for Ca /oo/ Cb 



G 
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(A AND B) 

C(A OR B) 

C(NOT A) 



C A 
0, 



c f 



(6) 



XOR: c (A+l-B) c (1-A+B 



XOR: 1-c (A+B) c (2-A-B) 




XOR: (1-B+AB) (1-A+BA) 



XOR: 2AB-A-B+1 




Figure 6: Different realizations of XOR 



These are only linear functions. To allow the value oo 
for the C one has to extend the definition of OR and 
NOT according to 



a 



(A AND B) 



a 



(A OR B) 



a 



(NOT A) 



oo for Ca = cxd or Cb — °°, 

( C A for C B - cxd ^ Ca, 

< C B for C A -oo^C B , 
t cxd for Ca — Cb — oo. 

for Ca — oo . 



and one finds that the functions on the whole interval 
[0, cxd] are nonlinear. This representation is interesting 
in as far as the AND is linear and the nonlinearity of 
the OR can be implemented by just skipping functions 
with final C(f°) = cxd from F° . We will call this a hard 
implementation of the OR and other implementations 
soft. Then for all functions f° under consideration, i.e. in 
F° , C(f°) is a linear function of its constituents C{(f°). 
But note that unlike for finite m here Gnotnota) is 
not equal Ca for ^ Ca ^ cxd. 

LIN in contrast to AND is strictly monotonically in 
all d even if one C%* — m. This allows also if C%* — m 
cannot be changed (momentarily), the use of gradient 
information to improve the other C{. To include in- 
formation related to OR one can combine LIN with a 
multiplicative implementation of OR: 

C( A lin B) = aC A + (1 - a)C B ,0 < a < 1, 
C(a or B) - C A C B /m. (7) 

5.3 Special Properties: probabilities, 

logprobabilities, distances, and averages 

We shortly discuss some special properties: probabilities 
(or partition sums, if not normalized), logprobabilities 
(related to free energies), distances (related to scalar 
products) and averages (expectations, related to ener- 
gies). 



5.3.1 Probabilities 

For probabilities (not densities) the logical combina- 
tion of events are defined 



p(A AND B) 

p(A OR B) 

p(NOT A) 



p(A)p(B\A) 
p(A)+p(B) 
l-p(A). 



p(A)p(B\A) 



(We 

can understand corresponding expressions for densities 

formally to be defined by p(F(A, B)) = f F(A B \ p{x)dx 

— J A J B Pf(a,b)( x a, x s)dx Adx b , and interpret the rules 
as p(xaAN DxB)dxAdxB — PA{xA)PB{xB\xA)dxAdxB 
so the integral factorizes for independent components 
xa 3 xb- With a function 6 fulfilling J A S(A)dxA = 



1 f 6(B)dx B 



1, 



p(x aO Rx s)dx Adx b 



(pa(xa)$(B) -\-Pb(x b )S(A) -\-pAB(xA,x B j)dx A dxB, i.e. 
(pa{x) + pB{x))dx for disjunct events xa, xb, and 
p(NOTxA)dx — (6(A) — pA(xA))dxA- So integration 
of densities to obtain probabilities p(A) = J\ p(x)dx has 
the form of an OR for disjunct events.) With respect to 
the two (similarity) properties P(A) and P(B) the prob- 
abilistic AND, i.e. p(A)p(B)(p(B\A)/p(B)), is not one 
function but a whole family, parameterized by p(B\A), 
but all coinciding at the four binary corners and the 
quantitative form of a probabilistic AND changes with 
the dependency of A on B. 

To take advantage of independence in the case of OR 
one can use DeMorgan's law and write for a set of in- 
dependent A{, p(ORiAi) = 1 - riz=o( 1 - P(A)), where 
p(N OT(AN D{A{)) factorizes. This form is also known 
as 'noisy OR' (Pearl, 1988, Jensen, 1996), and one may 
set one A{ constantly equal to one (or zero) to have a 
nonzero baseline if all other A{ are zero (or one). 

5.3.2 Log-probabilities 

Assume < p < 1 is a probability (or for AND 
and OR a bounded density < p < c) then the log- 
probability L = In p is in the intervals [0, — cxd] (or 
[lnc, — cxd], respectively) and the rules for probabilities 
become 



L(A AND B) 

L(A OR B) 

L(NOT A) 



L(A) + L(B\A) 



Yn ( e L(A) + e L(B)_ e L(A AND B)^ 



ln(l 



MA) 



(8) 
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Especially interesting is the OR for disjunct events, 
where it is equivalent to an XOR. Fig. 6 (where for prob- 
abilities the role of one and zero have to be exchanged) 
shows that XOR is prototypical for a situation with two 
clearly separated degenerated maxima. The two degen- 
erate global maxima remain under slight perturbations 
both still local maxima. It needs a 'considerable' de- 
formation of the probability surface, or, equivalently, an 
in comparison to other influences relatively flat XOR 
probability surface, to let one of the two local maxima 
disappear. At this point the solution to the problem of 
finding the maximal probable state shows a bifurcation. 
We will see below that this is related to phase transitions 
and may remark at this point that the term 'tempera- 
ture' is related to the relative flatness of the XOR. 



In contrast, OR for independent events, shown for 
example in Fig. 4 has a continuous line of degenerate 
maxima, and already a small perturbation can lead to 
a unique maximum. Indeed, any OR can be expressed 
as an OR for independent events. Choosing for example 
wi = A AND 5, lo 2 = A AND NOT 5, lo 3 = NOT A 
AND 5, lo a = NOT A AND NOT 5, we have A OR 
B — lo\ XOR L02 XOR 1^3, leading to the degenerated 
maxima. 

Therefore, we look in a bit more detail to the OR 
for disjunct events. Expanding the log-probability for 
disjunct A, B around some Lq we obtain in second order 
in A A = L(A) - L , A A = L(B) - L 



oo ^ n 



L(A OR B) « J L +ln2+i(A A 



A B )A(A A -A B f 

O 



ln2- 



m±^i + i {L{A) - L{B)f <„ 



1 



In- 



MA) 



In- 



^W 



+ -(L(Af-L(A)L(B) + L(Af), 

independent of Lq, and exact if L(A) — L(B). The last 
line shows that the first order terms gives the annealed 
approximation where a nonlinear function of the aver- 
age is replaced by the average of the nonlinear functions 
g(<x>) = g(J2iPi x i) ~ Y^iPidi^i) = <g(x)>, with 
^2jPi = 1 (See for example, Seung, 1995). For convex 
functions the inequality gQ2iPiXi) ^ ^2iPi9( x i) holds 
for convex combinations, i.e. ^2iPi = 1, with equality if 
all Xi are equal (Jensen's inequality). For the concave 
logarithm, concave / meaning —/is convex this reads, 
for example, ln(J2iPiXi) > Y,iPi^ n ( x i)- 

We may also consider a weighted OR, for disjunct 
events A* 



L(OR?Ai) = In P(OR^Ai) 



with Li = L(Ai). Disjunct A{ may or may not be el- 
ementary events coi E ^ of the model under study, but 
they can always be seen as (effective) elementary events 
for [J i A{ with respect to a specific OR. A weighted OR, 
corresponds to an unweighted OR in an enlarged event 
space, as can be seen by writing 




*Li+L a i 



L° 



In a z - . 



This may be interpreted as a situation where the A{ have 
an additional independent dimension a with values i, i.e 
of the labels of the mixture components, so the complete 
event A{ has log-probability L(Ai)-\-L a (i) corresponding 
to an AND for independent events with pi — a z -/ ' Z a and 
Z a — X^" a «- Thus, Li — L(Ai) can be seen as a special 
random variable on the events i. 

Especially, a (weighted ) OR has the structure of a 
cumulant generating function with respect to the nor- 
malized weighting factors cii/Z a , representing probabili- 
ties. (See also Section 5.3.4.) To see this, we look at the 
Taylor expansion of P around Lq 



PiOR^Ai) 



J2 a ' eL ' 



J2a*e A 'e L ° 



k i k 

which contains in the expansion coefficients all kth mo- 
ments of the differences A z - (or of the Li itself for Lq = 0) 
with respect to the A{ and pi, 

1 n 
M£(A)=<A*> fl =— J>A? 



d k 

d@ k Z a 



1 n 



>/?A;| 



1/3=0 



— < e fj ^ > I 

d pk <e >" 1/3 = 0' 



1. 



with M a = 1 due to A° 

We can also introduce a common scaling factor /?, 
often called inverse temperature for all Li. Then because 



Up 



df(j3Li) , 
d/3 | /5=° 

.idf(PLi), 



L 



df(j3Li) , 
d(pLi) ^=0 

df(Li) , 



dLi 



\L, = 0,/3 = l 



Li 



dLi 



\Li = 0> 



both derivatives generates the coefficients of the Taylor 
expansion. Indeed, one sees directly 






1/3 = 



— - < e fj ^ > I 

d/3 k ^ e >* 1/3 = 0' 



Thus, while the moments < A k > a are the Taylor co- 
efficients of the moment generating function < e^ A > a 
for (3 — 1 they can also be calculated as their deriva- 
tives at j3 = 0. Thus, they are up to the factor Z a e L ° /k\ 
the Taylor coefficients of a high temperature expansion 
around (3 — 0. Notice that we expanded every Li around 
the same Lq so E l ° could be factored out. In general we 
could use different origins L l for different Li. Indeed if 
the e@ Ll are becoming relatively more separated (large j3 
or low temperature case) a common origin for the Tay- 
lor expansion becomes a less good choice. (For finite 
systems, i.e. a finite ^ z -, the sum of exponentials is an 
analytic function so the convergence radius of the corre- 
sponding power series is infinite. In contrast for example 
for the logarithm function (see below), the convergence 
radius of a Taylor expansion is limited.) These prob- 
lems are typical for phase transitions (See Section 10) 
where the log-probability of L(OR) changes its number 
of maxima, and the behavior of the L(OR) at Li = Lq, Vi 
is drastically different (being e.g. a minima) than that of 
the components at Li = Lq (where it may be a maxima). 
Analogously, we can expand L = InP in the param- 
eter (3 around (3 — 0, and obtain the high temperature 
expansion of L(OR) 



L(OR™f3Ai) = In P(OR?f3Ai) 

00 ok 



= \nZ a +(]L + J2j^C a k (A) 
where the coefficients 



(10) 



k = 



Q(A) 
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dp 1 



In 



■l> e/?Ai 



/3=0 



called 



ci 



slants. For example one finds, Cq = 0, 



M<? 



(Mf ) 2 , CI 



MS 



ZM%M? 



CI = Mf 
2(M 1 a ) 3 . 

As the origin of expansion Lo enters through the lin- 
ear term only all cumulants of order k > 2 are Lq in- 
dependent (i.e. independent of the mean), and the first 
cumulant, i.e. the mean, compensates for the constant 
because Cf (A) = ^ £• a 8 -A 8 - = C?(L) - Lq. Thus, for 
all Lq the expansion looks like for Lq = 

°° /? fc 

k=0 

For example up to second order this gives (a z - = 1 =>> 
Z a = n) 



L(Oi??A t -) = In P(OR?Ai) = In 



^a z e L * 



lnZ a + < L > a 



(< L 2 > a - < L > 2 a 



(11) 



It may be interesting to note, that for example the mean 
(first moment and first cumulant) for our special random 
variable L = Inp is just the negative average informa- 
tion (entropy) < L > a =< Inp > a — —I a (p) with respect 
to the pi = cii/Z a . The negative log-likelihood — L is 
also sometimes called bit-number with the correspond- 
ing bit-cumulants with respect to the cii/Z a generated 
by In < e _/3L > a , related to the Renyi information Ip by 
In < e - {l ~^ L > a = (p - l)Ip (Beck & Schlogl, 1993). 

We can also split Li for all mixture components i into 
several parts, Li — ^\. Lij, (which means ANDing), and 
define a corresponding set of f3j so we that have 

LiOR^ANDjAij)) = lnJ2 a i e ^ jf3jLt,J - 

i 

Then the mixed derivatives of L gives the multidimen- 
sional cumulants of Lij or A ZJ - = Lij — Lqj according 
to 



C\ 



k 1 ,k 2 



.K:. A 



> A r) 



d k 1 d k 2 



d0jl d% dfc 



■In 



{ip-'^-\ 



p=o 



The terms e@ Ll become for large j3 (low temperature) 
the smaller the smaller Li. In the limit /?—*■() only 
events with maximal probability p(Ai*) — max z -p(A z ') 
('ground states') survive. Therefore, skipping from the 
sum ^2{P(Ai) = X]z e f° r disjunct Ai the smaller 

terms P(A{) < (low probability events) and keep- 
ing only the larger ones (p(Ai) > 0) (high probabil- 
ity events) is also called low temperature expansion. For 
continuous variables i where the sum is replaced by an 
integral this approximation appears as saddle point ap- 
proximation (see Section 7.2) and their higher order vari- 
ants. 

We also take a short look to an OR for independent 
events. To make use of the factorization of probabilities 
for independent events we apply DeMorgans law to write 



for discrete events, corresponding to the noisy OR for 
probabilities 



HORiAi) = ln(l — JJ(1 - e L 'j), 



L(NOT(OIUAi)) = ln(JJ(l - e L ')) 



(12) 



EMl-e L ")- 



For dependent Ai the conditioned factors have to be 
used. Jaakola & Jordan (1996) give an expansion of the 
function 



Hl-e L ) = J2^9(-2 k L), 



k = 

in terms of the logistic function g(z) — 1/(1 + e~ z ) and 
use it for efficient, approximate calculations in graphical 
models. 

Partitioning the events Ai into non-overlapping sub- 
sets, i.e. into disjunct (effective) elementary events loj 
we have p(Ai) = p(OR ? - loj) = ^2j GJ ^p(^j)- Then we see 
that expressing the OR for non-disjunct Ai by disjunct 
loj gives 

L(ORiAi) = L(OR? to/) = ln(^iVje L ^)), 



which reproduces for the smallest of those partitions 
the product terms p(<Uj) = e Ll ^e Ll ^e Ll ^ • • • in Eq.12, 
reweighted with the number Nj of Ai which contain loj, 
and without the intermediate terms with oscillating sign. 
One can now, for example, apply the high temperature 
expansion with Z a = Zjy = ^\. Nj . 

5.3.3 Distances 

Monotonic functions of negative distances ||y^(/°) — 
T y g|| 2 are often used for log-probabilities. The defini- 
tion of q can always be changed in such a way that 
T y =0. The most common example are data terms 

d2 = Y,i ^(yt ~ Vqrif )) 2 where the yf represent the 

template vector and y qi the deterministic answer y qi (/°). 



In spaces where the distances fulfill 



\9- 



2||flf|| 2 + 2||flf| 



\9- 
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there exists a scalar product, written as < • | • >, related 
to the distance by 

\\h\\ 2 =<h\h> . 

Different positive definite kernels O define different 
scalar products 

<g | h>o=<9 \0\ h>— dx dx' g(x)0(x, x ( )h(x r ). 

Minimizing squared distances 
\\y q (f)-T q \\ 2 =<y^f)W(f)>-2<y^f)\T q >+c, 



(written for real < y q (f°)\T q > = < y q (f°)\T q >* 
—<T q | y q (f°)> with c an ^-independent constant and 
* indicating complex conjugation) is equivalent to max- 
imizing scalar products (overlaps), like ^2 i -^ r T q ^y q% , 

for normalized y and y. (But normalization is a max- 
imal nonlocal condition.) For overlaps one has to use 
rules with zero representing false and one representing 
true so the definitions of AND and OR are exchanged. 

For example, properties which have positive values 
bounded by m can be obtained from log-probabilities 
or distances by Ca = g~ 1 (d 2 (A)) = m(l - e~ d (A )) = 
m(l — e L ( }) — m(l — p(A)) with inverse L = g(C) — 
— ln(l — C/m) — ln(m/(m — C)). This gives for inde- 
pendent A and B the equations (5) for properties. Thus, 
using g(C) — — ln(l — C/m) directly relates properties 
to probabilities according to d — m(l —pi). 

Writing OR in terms of Euclidean square distances 
by choosing L{ = — df/(2af) + ln c z -, with some constants 
(Ji and C{, one obtains for disjunct events a Gaussian 
mixture model 



transform 3 



P(f°) oc 



r Cit -^(/°)/(2^). 



For non-disjunct events product terms like 



IT 



-<*?/*; 



must be subtracted (or approximately a cut-off function 
can be included). 

5.3.4 Averages 
Unnormalized probabilities 

Averages or expectations can be related to unnormal- 
ized probabilities Z(A, /?), (with (3 a fixed parameter we 
will discuss below) also called partition sums 



Thus, 



p(A,p) = 
Z(A,(]) 

z((]) = z(n,(]) 



Z(A,p) 

z(p) ' 



Z{u,P) 



E 

coeA 
wen 



(13) 



for a complete set of disjunct (elementary) events to, 
and p(Q) = 1. If we define Z(A\B) by p(A\B) = 
Z(A\B)/Z partition sums transform under logical op- 
erations like probabilities with an additional normaliza- 
tion factor Z analogously to m in the rules (5) (with 
AND and OR exchanged). If we choose Z(A\B) — 
Z(A, B)jZ(B) — p(A\B) (what one can do for averages, 
see below) then Z appears only in the NOT. Defining 
shifted log-probabilities — (3F by 

z(p) = e-r jF « j \ z(A,p) = e-r j W j \ 

the F , also called free energies, 



±lnZ(/?), F(A,/3) = —In Z(A,/3), 



F(A AND B) = 


= F(A) + F(B\A)- F 


F(A OR B) = 


-- -Iln(e-^)+e-^W 




_ -j3F(A ANDB)\ 


i^(NOT A) = 


-_ -^(e-^-e-W)). 

P 



(14) 



For the sake of simplicity the /^-dependence of F is here 
not written explicitly. 

We choose as family A a set of disjunct events to E Q- 
For the corresponding to we define the free energies 

to be /3-independent. We will call them energy of w and 
can then write 

e -PE{u) 



p{w,p) 



z(p) 



z(p) 



-I3(E{uj)-F(P)) 



(15) 



29 



Clearly, the OR looks quite complicated, even for dis- 
junct events, and indeed, it is the source of many difficulties. 
The summation corresponds for example to the calculation 
of partition sums in statistical physics. Another summation 
outside the logarithm is added for disordered systems, like 
for spin glasses, where the (shifted) log-likelihood (or energy) 
function governing the system is also in reality not exactly 
known. In principle this corresponds to adding additional 
components (AND) to the elementary events u> with possible 
realizations i. Combining different possible realizations L % by 
OR corresponding to In ^\ p t e l with p t a probability distri- 
bution over realizations of the energy function. Alternatively, 
one can restrict to averages (weighted AND) of observables 
(and therefore the partition sum) over realizations i. That 
means one integrates over part of the components of u> and 
considers s ^ j . p t \ne l = T ^ j .p l L l . Including the thermical 
OR over complete sets of disjunct events (states L ZJi ), in 
the average (over different 'replicas' i of a system with dif- 
ferently 'quenched' interactions), yields ^ n ln(^ m e lJ ) = 

Mnr £7 ei -;) = M£7 • • • £7 n: - L -h ^^ * ^ 

be called replica index. The product of sums creates all 
kind of product terms for different systems j t and can be 
huge or infinite. The replica approach uses the identity 
In Z = lim n ^o — —- to substitute this large product by a 
product with number of factors going to zero, i.e. n — ► (not 
1 !). In the corresponding asymptotic mean field approxima- 
tion also correlations between different replicas of the systems 
enter the theory. However it requires more assumptions and 
calculation tricks than a standard saddle point approxima- 
tion. For example analytical results for an integer n have to 
be analytically continued to real n to obtain the limit n — ► 0. 
(Mezard, Parisi, & Virasoro, 1987). (Special observables are 
those which do not fluctuate with i, like for large systems 
for example variables with fluctuations vanishing fast enough 
with the system size. The values of such observables x n * 
is then the same for every individual system of the ensem- 
ble. Restricted to only non-fluctuating observables x n , i.e. 
LOi(x) = Lo t (x n * ) and therefore E(lo) = E(x n * the probability 
distribution p r becomes deterministic and the ^\ disappears. 
This is the reason one is especially interested in the (thermo- 
dynamic) limit of infinite system size where for some (self- 
averaging) observables the fluctuations around the average 
can disappear, e.g. for uncorrelated variables the Gaussian 
limit theorem applies.) 



We see that the log-probability L = -(3{E{lo) - F{(3)) is 
hereby written as the (negative, /?-scaled) difference of 
a /^-independent energy which describes the system, i.e. 
the variation of the probability between the to, and a /in- 
dependent 'scaling shift' containing the (3 (temperature) 
dependence. 

Sometimes one may also wish to consider effective en- 
ergies E e ff(u),f3) which are /^-dependent and might be 
free energies with respect to a finer set of elementary 
events. They are however most times used in the range 
where they are approximately independent of (3 (tem- 
perature), and we will, if not stated explicitly otherwise 
always assume that energies are /^-independent. We 
can also take the random variable E being dependent 
on other random variables Ej(lo) so that (3F(lo,(3) = 
g({Ej(co)}). Useful will be a linear combination, for 
which we write 31 

3 

with Ej independent of (3j and (3. For the sake of sim- 
plicity we will understand F(A, (3) to mean F(A, /?, {/3j}) 
in that case, and analogously for other variables like Z 
and p. 

Averages and generating functions 

Partition sums can be seen as averages under a uni- 
form probability distribution p(co). In general, averages 
or expectations of a function h(co) over a family A of 
disjunct events to E A are defined as 



h(A,/3) =< h{uj) > A}/3 -. 



p(A,(3) 



and we can include h(A\B) — h(A,B)/h(B). We can 
extend the definition of E(lo) to E(A, (3) for all events 
A E Q by defining the energy to transform like (i.e. to 
be) an average. 

Thus, E is a random variable bounded from below, 
defined by the number (or vector) E(lo) > — oo related 
to every elementary event to E A. We may call p(co) the 
distribution generated by (3E which can have the form 
^2j/3jEj. On the other hand every (vector) f3E, with 
components bounded from below, is a 'generating ran- 
dom variable' of some p, with Z(A, (3) the normalization 
constant of e^ E ^ ^ on A. 

We already encountered unnormalized probabilities 
and shifted log-probabilities in the discussion of the 
(weighted, but also unweighted) OR in Section 5.3.2. 
There the L could be seen energies for (effective) ele- 
mentary events A{, i.e. for A{ they are also shifted log- 
probabilities with respect to the unnormalized probabili- 
ties cii corresponding to Z(co, (3—1). We also introduced 
auxiliary variables f3j and used a splitting of Li into com- 
ponents ^\- /3jLij + Lf equivalent to ^\- (3jE(lo). And 
therefore we can here relate energies (averages) and free 
energies (OR over Q) the same way by derivatives of gen- 
erating functions as we did in 5.3.2. We briefly formulate 
this principle again in terms of E and F. 



31 Then in physics usually only one subgroup Ej is called 
energy, another might be called particle number. 



Now we want to calculate the expectations of the gen- 
erating random variable Ej. For that purpose we use the 
required (3 -independency of E and Ej to write 

which includes the case of a one component E . For ef- 
fective, i.e. j3 dependent energies E^ (to , j3) this would 
give 

i J 

Analogously, using Eqs. (13) and (15) we find the 
same relation for general A 

Ej(A,l3) =< Ej(w) > At p= -±-lnZ(A,/3), 



or for effective energies 



Ef(A,P) + Y.^ E l !! ( A ^) 



dPi' 



d/3j 



In Z(A,f3). 



Assuming these averages to be measurable, this relation 
might be used to test the range of validity, i.e. the range 
of /^-independence of an effective energy. 

Linearity of the expectation allows to use this also 
to calculate averages of random variables which are lin- 
ear functions of E(lo). To calculate expectations and 
higher order moments or cumulants of general random 
variables h(co) which are either nonlinear functions of 
E(lo) or which are no functions h(E(uj)) of E(co), like 
e.g. if E(lo) — E(h(uj) 2 ), one can extend the partition 
function by adding — Xh(co) ('auxiliary field') to the ex- 
ponent 



that 



Z(Lo,/3,\) = e-P E( ^- Xh(w \ 



h(A,f3) = -— In Z(A,f3,X)\ x=Q . 



In general we find 



Z(A,f3) 



_ ^ „ — Xh(oj) ^ 

-< e v J >A,{3- 



00 ( \\k 



jfe! 



which therefore is the moment generating function, i.e. 
the Mh-moments M k {h, A, f3) —< m k (uj) > can be 
found by differentiation 



M k (h,A,(3) 



d k Z(A,/?,A)| 
d(-X) k Z(A,/3) l A=0 * 



Cumulants are defined as derivatives of the logarithm of 
the generating function, also called cumulant generating 
function, 



C k (h,A,f3) 



d(-xy 



■In Z(A,f3,X)\ 



A=0' 
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where we skipped the A-independent term In Z(A,/3) 
which is not relevant after differentiation. It is easy to see 
that cumulants have the nice property of being additive if 



the probability p(co) = EL^'^) factorizes into indepen- 
dent subsystems 32 (Beck & Schlogl, 1993). The second 
cumulant is the well known variance < h 2 > — < h > 2 . 
For the case of a Gaussian p(co, /?), i.e. E(lo) quadratic 
in to, (multidimensional for vector to) Wicks theorem 
gives higher order moments, often represented by dia- 
grams (see for example Negele & Orland, 1988, Zinn- 
Justin, 1989, Itzkyson & Drouffe, 1989 ). 

Boltzmann— Gibbs— distributions 

Now we want to relate the shifted log-probabilities 
—(3F(lo) to the expectations of the generating random 
variable Ej. In general, for A ^ to the difference between 
energy and free energy (i.e. between averages and shifted 
log-probabilities, coinciding on the elementary events to) 

H(A,(]) = (]E(A,(])-(]F(A,(]) 



ui£A 



p(A,(3) n p(A,f3) 



= - <ln pjAjr) >A >"- 

Here H(A,/3) is the average information (in nats, not 
bits) or entropy. The free energy can be thus be ex- 
pressed by the average energy and entropy as 

F(A,(]) = E(A,(])-±H(A,(]). 

It is well known, that the distribution generated by 
(3E maximizes the entropy under the constraints that 
p(uj)/p(A) is normalized and the expectations of all ex- 
pectations Ej(A, (3) are fixed (For given vector (3 also for- 
mulated as principle of minimum free energy F. See e.g. 
Balian, 1991, or in connection with path integrals Roep- 
storff, 1991). Indeed, the stationarity equations read, 
introducing the corresponding Lagrange multipliers f3j , 
a 



dp(co,/3) 



H(A,(3)-Y J P 3 E 3 (A,(3) 



a < 1 > 



giving the Boltzmann-Gibbs distribution: 



Z(P) ' 

with Z(/3) = e «(/ 3 )+ 1 > We recognize the general 
form (15). Thus we can say a probability distribu- 
tion is the Boltzmann-Gibbs distribution of its gen- 
erating variable(s) (3E. The distribution for E is 



For example spatial systems with a weakly enough short 
range interaction, so they can be seen in the large n-limit as 
a collection of independent local subsystems, have cumulants 
which are asymptotically additive in these subsystems. In 
these cases, besides the ^-generating energy, the free energy 
F oc In Z and not Z is an additive variable. As the vari- 
ance of a sum of n independent random variables scales with 
1/n the cumulants have then for large systems low variance, 
hence they are proportional to the volume (extensive vari- 
ables) and possible candidates for macroscopic observables 
of the system. 



given by summing over to with energy E (which could 
be a vector), i.e. p(E) = p(w E ) =Z(E, /3)/Z(/3) = 

fdcoS(E(co)-E)e-^^ E ^- lnZW 

with loe — {lo\E(lo) = E} and n(E) the energy density 
at E. 

Families of distributions generated by varying (the 
vector) (3 within some parameter space B are called 
exponential families, with canonical parameter vector 
/?, canonical statistics (-E) and cumulant (generating) 
function In Z. (See for example Barndorff-Nielsen, 1978, 
Amari, 1985, 1995, or Appendix D in Lauritzen, 1996 
and references therein). 

The model from Section 2 defines an elementary event 
to = (g, y, /°) or, if we include the internal variables y q z q 
of questions ?, u/ = (q,X* ,Y X *,Z*, f°), with q eQU 5 
the part of the basis used in Q l , Yxq the corresponding 
answers for x E X q , and Z q the set of internal noise vari- 
ables for q. We are for example (in the case of determin- 
istic / and of p(q) not dependent on other variables) inter- 
ested in calculating the expected risk r =< l(co, /) > P) n 
under the total posterior probability 

p(co)=p( q )p(y\ q ,f)p(f\D) = e L(,) + L(y\ q ,n + L(J°\D ); 

in terms of log-probabilities L(lo) — L(q) -\- L(y\q, /°) + 
L(f°\D). Thus we can interpret the Bayesian poste- 
rior distribution as Boltzmann-Gibbs distribution aris- 
ing from a maximum entropy procedure for the total 
log-posterior L(lo) with an average so that (3 — — 1 and 
shifted so that Z(Q) = 1. It seems simpler, to use 
the traditional Bayesian formulation than the equiva- 
lent maximum entropy formulation. However, there are 
cases where averages are directly measurable, with mea- 
surement error nearly zero. 33 Besides deterministic vari- 
ables, self-averaging variables in large systems belong 
to this class. Examples, are macroscopic observables 
in physics, like energy, which lead to the (generalized) 
canonical ensembles used in statistical physics. An in- 
deed, in these cases the various implementations (i.e. 
models F° , like microcanonical, canonical, grandcanoni- 
cal ensemble) are asymptotically equivalent for large sys- 
tems. 

Approximation and Kullback— Leibler entropy 

Consider another probability distribution p f (co) gen- 
erated by E'{lo) = -jrlnp f (w,l3 f ) + F'{A,f3') = 

— ir(\D.p f (uj, f3' ) — Z(A,/3 f )), with the same normaliza- 
tion, i.e. Z(A,/3) = Z f (A,/3 f ), so the expectation of E f 
under p(co) reads 



E'(p,A,(])=<E'> 



(P.A.I3) 



(j; < \np'(cu, (]') >(PiAiP) +F(A, /?)) p(A)-\ 
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Therefore, for example Jaynes, 1996, sees the two ap- 
proaches as two different methods, applied in different situ- 
ations: Maximum entropy to fix averages which are known 
without much computation, and Bayesian methods dealing 
with models to calculate the relevant average. 



where E f is averaged over p(/3) and not over p f (/3 f ) 
generated by itself. Concavity of the logarithm func- 
tion allows to apply Jensen's inequality to the differ- 
ence /3 f E f (p,A,/3) - 0E(A,/3) = < lnp > - < Inp' > 
=K(p,p f ), i.e. the Kullback-Leibler entropy and we ob- 
tain 

P'E'(p,A,p)>pE(A,p). 

Thus, the (3E generating the averaging distribution has 
the minimal average under this distribution. (3 f E f repre- 
sent different loss functions. We define an approximation 
problem to be the problem of minimizing the expectation 
(expected risk) over A under p for a family of loss func- 
tions, defined on the same set to and having all the same 
normalization Z over A. Those loss functions can be pa- 
rameterized in the form f3 f E f = —c\ lnp f (io) — cq with p f 
a normalized probability density. We will call this a fam- 
ily of approximation losses. Then, for any parameterized 
family of /?' E f the probability related to the solution with 
minimal expected risk has minimal Kullback-Leibler dis- 
tance to the actual averaging probability distribution. 

Eq. (5.3.4) defines the true expected risk (3E as the 
solution of a minimization problem. This corresponds 
to a variational method for calculating the true expected 
risk (see e.g. Balian, 1991, Neal & Hinton, 1993, Dayan, 
Hinton, Neal, & Zemel, 1995): 

1. Besides the effects of not included parameters, the 
difference between the true expected risk (3E and an ex- 
pected risk /?* E* , minimal in a parameterized subspace, 
has only contributions of second order in the parame- 
ters used for minimization. 2. The true expected risk is 
bounded by the approximated risk. 34 

5.4 Including human knowledge and other 
available preprocessors 

A human interface for constructing fuzzy priors consists 
of two steps: 

1. Subsymbolic level: 

i. Defining properties d E C, i.e. functions of a 
parameterization of/ , as correlates to inter- 
nal concepts d. For example, a property can 
be specified as a typical variant of an eye or a 
typical structure of an electrocardiogram, but 
as well by the output of another available (e.g. 
approximation) algorithm. The properties Ci 
are in some contexts also called linguistic vari- 
ables (Zadeh, 1996). 

ii. Definition of possible combinations of prop- 
erties, i.e. mappings C x C — » C or linguistic 
functions, approximating internal mappings, 
i.e. the structure of concepts. Specifically, 
a linguistic 'not', 'and', 'or' can be related 



Variational principles for minima (or maxima, respec- 
tively) have the advantage of giving such a bound, while vari- 
ational principles related to saddle points (e.g. for complex 
functions), do not. On the other hand, for minima (or max- 
ima) all second order corrections have the same sign, while for 
saddle points second order contributions with different sign 
can average away (see for example, Lemm, 1995ab). Espe- 
cially in high-dimensional spaces this may considerably im- 
prove the approximation. 



to some real valued extension of the logical 
NOT, AND, OR. 

2. Symbolic level: Combining simple properties to 
create complex ones, i.e. applying linguistic func- 
tions (rules) to linguistic variables according to the 
communicated symbolic structure of the prior. For 
example the property being a face like object can 
be build up from properties of having two eyes, a 
nose and a mouth like object. Probabilistic rules 
are a special set of linguistic functions which de- 
pend on a (communicated) dependence structure 
of the variables. 

Consider we want to construct a property C (de- 
terministic question) of /°, describing prior knowledge 
about images of faces, out of subproperties d by fuzzy 
operations. We can choose for example a log-prior 

Lcx-\\C-Tc\\ 2 + c. 

Also, the subproperties d can be deviation properties 

CiK-WC'i-Tilf + a. 

For probabilistic questions one might wish to integrate 
over y q with p(y q \q, f°) 



C q OC 



■ / dy q p(y q \qj°)\\y q - T yq \ 



The templates Tc, TJ, T y can be chosen as output 
of an available approximation y q , e.g. T y — y q . The 
approximation y q may be produced from a previously 
trained artificial neural network, or any other statistical 
approximator, as well as from an expert. A construc- 
tion according to fuzzy methods can be as follows: Let 
p(y\q, /°) be the probability that the given image q is a 
face (i.e. y=face). (That means, this is the classification 
and not the generation probability for faces.) Let T y 
be the answer of some already available face detector. 
Also, several T y , i.e. several answers or different ap- 
proximation methods, can be included at the same time, 
according to their known or subjectively believed depen- 
dency structure. The dependency between approxima- 
tors may be arbitrary including as special cases approxi- 
mations which are independent or disjunct (only impor- 
tant for OR not AND). 35 The template T y itself can 
be constructed according to fuzzy methods: Define (/°- 
independent) (/-templates T q for q (Not ^-templates 
T y !). They may correspond to typical constituents like 
eyes, mouth, nose, and look like J^. \\q{ — T^ ye \\ 2 with i 
denoting the pixel index. Define transformation of tem- 
plates, including at least translation (represented by a 
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The problem of combination of different approximations 
of y q (or more general of arbitrary available data) to get a 
better approximation of y q is just a version of the usual sta- 
tistical approximation problem. Thus, a great number of 
possible algorithms is available to deal with the problem. 
Some methods refer especially to the situation where the 
data are approximations of the same value (combination of 
experts/approximators/classifiers). Tree-like methods con- 
struct a local 'domain of responsibility' for each expert. (See 
for example the mixture of experts, (Jacobs, Jordan, Nowlan, 
& Hinton, 1991; Jordan, Jacobs, 1994).) 



transformation of the index i of the template). Combine 
the constituents, their possible variations and combina- 
tions with fuzzy AND, OR, NOT (which includes more 
complicated rules like XOR or IF • • • THEN) to obtain 
a final property T y . This defines a fuzzy face classifier, 
incorporating human prior knowledge. 

We now discuss two principal approaches to incorpo- 
rate preprocessor information, like the fuzzy templates, 
into a subsequent algorithm. 

5.4.1 Two coupling principles 

The loss minimizing algorithm makes the decision /. 
Such an optimizer must have an interface for relevant 
data, i.e. for the q l and corresponding answers y. This 
allows to distinguish two principal variants to include 
the information of a preprocessor (See Fig. 7): 

1. Feeding the output of the preprocessor in the given 
interface for relevant data, including the entrance 
for the y which define the goal of learning. We will 
call this prior cascade and the preprocessor a data 
model generator (Fig. 7 top). 

2. Feeding the preprocessor output to extra input 
channels of the subsequent optimizer. We will call 
this case input cascade. Here the preprocessor has 
not necessarily to produce a data model. It can be 
implemented in two variants: 

a. asynchronously with the data D l (Fig. 7 mid- 
die), 

b. synchronously with the data D l (Fig. 7 bot- 
tom). 

(/-variables which enter the loss function only in- 
directly through f(q), can then be skipped. This 
allows to effectively replace q by preprocessor out- 
put. 

We now discuss the two approaches in more detail. 

5.4.2 Prior cascade (data modeling 
preprocessor) 

In a prior cascade, a data modeling preprocessor in- 
tends to produce a (fuzzy) model / for the input of the 
optimizer, i.e. of the relevant data D l . A Bayesian model 
is a special model formulated in terms of probabilities. 
Accepting its validity, the model can be used to create 
new virtual examples D v,i for relevant questions q E Q l . 
For the corresponding relevant data D l a loss function is 
defined fixing their 'interpretation, for the optimizer. For 
virtual examples D v,i it is assumed that the same inter- 
pretation, i.e. loss function, applies, so they can use the q 
and the y entrance of the optimizer. Thus, the flexibility 
of this approach consists in the fact that the D v,i fit a 
given input format and a given interpretation, so they do 
not require adaption of the optimizer. Such a preproces- 
sor may be seen as minimizing (implicitly or explicitly) 
an approximation loss for the data. This does not neces- 
sarily coincide with the loss used by the optimizer, which 
may, for example, include additional complexity penal- 
ties. Indeed, if the optimizer would have to minimize the 
same loss as the preprocessor, which would include hav- 
ing the same architectural restrictions, it cannot produce 
new results, and it is necessarily / = /, as long as no 
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Figure 7: Top: Prior cascade (data modeling prepro- 
cessor). The preprocessor uses the data interface of the 
optimizer. So the interpretation of its information as 
(virtual) data is implicit. Thus, 1. the preprocessor must 
represent a data model, 2. the optimizer has not to be 
adapted, (approx. loss = approximation loss) 
Middle: Input cascade (general preprocessor). The pre- 
processor uses not the whole interface for loss relevant 
data, i.e. at least not the entrance for y which are the 
variables for the loss function defining the goal of learn- 
ing. It can, however, for loss which is not explicitly q 
dependent (but always indirect over /), replace part or 
all of q. Thus, 1. the preprocessor does not need to repre- 
sent a data model, therefore its loss can also be adapted 
according to the actual needs of the optimizer, 2. how- 
ever, the optimizer must in general be adapted if the 
preprocessor adds information. 

Bottom: Special case of an input cascade: The input q 
is changed, either by replacing q by preprocessor output 
(only if the loss is q independent, e.g. multilayer neural 
network) or by adding new dimensions (an extreme case 
is the support vector machine, with the preprocessing 
implicit in the kernel). Those additional dimensions can 
also be the output of a data modeling preprocessor and 
represent an approximation for y or for the optimal reac- 
tion f(q). (Which is the same in a pure approximation 
problem.) For a more detailed explanation see text. 
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new data are included. If new data are available the pre- 
processor generates, only part of the optimizer's input, 
weighted according to its prior probability. Approxima- 
tion loss is related to any non-approximation loss which 
depends on the relevant data. 

Consider now a fuzzy preprocessor, like the fuzzy face 
detector we described in principle. There, new training 
examples can be generated sampling according to a term 
C q oc - f dqp(q)\\y q (f°) -T Vq \\ 2 + c in the log-prior (for 
probabilistic /° we integrate also over p(y q \q, f ))- The 
approximation T y for y q itself can be based on approx- 
imations implemented by virtual examples. Note, that 
virtual q can include questions which are not available 
as training questions, as long as a loss function is defined 
for them. For a face detector, those q can correspond to 
images which appear neither as faces nor as non-faces, 
i.e. which are no 'natural' images (= Q l ). Such 'non- 
natural' q can be images of faces with some subconcepts 
missing or exchanged, for example, one eye transformed 
to a none-ye, e.g. to a square. Such a squared eye face is 
probably not in the set of 'natural' images Q l but may 
be useful to teach the face detector the concept of an eye 
and its relevance for face detection. Including those ex- 
amples near the border between faces and (the concept 
of, not the natural) non-faces, specifies the generaliza- 
tion intended by the trainer. 

The (/-integral in the log-prior can be treated in sev- 
eral ways: 

1. the integration may be performed analytically, and 
the integral (infinite sum) replaced by an equivalent 
term, which is easier to evaluate, 

2. by virtual examples, i.e. pairs (q,T y ), e.g. fuzzy 
classified (See e.g. Lin, Kung, & Lin, 1997). In 
case of many virtual examples this estimation may 
be called a numerical evaluation of the integral. 

3. by application adapted sampling, i.e. virtual exam- 
ples are (also) used during the application phase 
of the algorithm. That means, in a specific ap- 
plication situation q l one includes newly sampled 
case-specific virtual examples, before answering to 
q l . For example, one can evaluate the correspond- 
ing T y l (if this can still influence the output and if 
available), and can also include T y , for other, e.g. 
'similar', q f . We remark that this variant 

a. assumes the availability of the approximation 
T y also during the application phase and not 
only during training, 

b. requires on-line ' application-adapted relearn- 
ing\ because the newly, case specific gener- 
ated examples have to be incorporated in the 
learning process. An trivial example of appli- 
cation specific on-line 'learning' is interpolat- 
ing between different available answers, a less 
trivial example are neural networks which are 
retrained with application specific 'hints' T y t 

before giving an answer to a specific q l . Also, 
a higher level approximator can include local 
prior terms according to the output of a par- 
allel working fuzzy face detector for a given 



image, preclassifing faces and non-faces. 

c. is equivalent to the use of an infinite num- 
ber of virtual examples, because for every 
q l (i.e. arbitrarily many) new examples are 
drawn. That does not mean that a single pre- 
diction depends on an infinite number of vir- 
tual examples, but the definition of the whole 
learning machine, and therefore its expected 
performance. 36 

5.4.3 Input cascade (general preprocessing) 

In an input cascade, a general preprocessor produces 
output T, for which no loss function, i.e. prewired in- 
terpretation, has to exist. This kind of information T 
cannot use the relevant data entrance of the optimizer. 
In cases, however, where the loss function is not explic- 
itly (/-independent (not y{q)~ or ^((^-independent!) the 
(/-part of the data can be completely skipped and re- 
placed. This does not include the y variables which de- 
fine the value of the loss function and therefore the goal 
for the optimizer. Thus, the output of a preprocessor 
can replace part of the q or can be added as additional 
input to the optimizer, which has to be adapted to the 
new format of the input. This internal adaption defines 
the interpretation of the additional data. 

In principle the total (information of the) previous 
approximator T could be added as new input dimension. 
However, the increase in complexity by irrelevant data 
tends to lower the performance in an input cascade (See 
below). Also a data modelling preprocessor can be used 
in an input cascade by adding available virtual data pairs 
to (/, leaving y unaltered. 

Especially, the information of the preprocessor may 
be feed in synchronously with every data pair y,q l . 
The transformed and often enlarged input space is also 
called feature space. Optimizing algorithms can often 
be adapted relatively easy for higher input dimension of 
q. Thus, a simple form of an input cascade or prepro- 
cessing is including part of the information T in every 
data pair (q,y). This can be the result /' of another op- 
timizer, but also linear independent components of the 
vector (/, or distances to prototypes obtained by unsu- 
pervised algorithms. We mentioned that in cases where 
the loss function is explicitly (/-independent the (/-part 
of the data can be completely skipped and replaced by 
T q , so that the original data entrance formally fits. This 
is the standard case of an hierarchical optimizer, like for 
example a neural network. 



In practice one can for example add output T 



Vq 



responding to the same (and maybe a few related) (/, 
i.e. q l — » {q l ,T y }. (Also aspects of the learning history 
might be added, but one usually assume this to be well 
enough represented by the actual internal state of the 
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36 Again, infinity means nothing else than assuming to be 
always able to do something if needed, like in this case creat- 
ing a template for every new input. That is how we denned 
the algorithm and if there are cases where we cannot create 
corresponding templates, then we are just not able to apply 
this algorithm, nevertheless we say the infinite data state- 
ment is true for the algorithm by definition, (definition = 
specifying possibilities of control). 



optimizer). If T y for a given q l is not available, one may 
use a q where a similar T y is expected, i.e. complete the 
template generator by adding another prior assumption. 
For example assuming smoothness with respect to a dis- 
tance || • ||, one can use a q with argming||^ — q\\ ('nearest' 
q) or interpolate between some of the neighbors. One 
may encode that a specific value is not available, or its 
expected correctness. For added as well as for replaced 
input a function / of the augmented and not of the orig- 
inal input is learned. That means, the preprocessor also 
has to be available during application of /. 

For T to be of any help, the preprocessor, must deal 
with aspects related to the loss function, i.e. it must 
have (implicitly or explicitly) a loss function related to 
the loss function of the optimizer. For example, the pre- 
processor can explicitly use the same loss function, and 
represent therefore a previously trained optimizer. As 
the loss function of the preprocessor does not have to 
produce a data model, it can be adapted to the state of 
the optimizer. A typical example of an input cascade 
with feedback of the loss functions is a multilayer neu- 
ral network. For all 'on-line' optimizers which do not 
use the complete available data set D l at the same time 
(e.g. backpropagation), the actual parameter values (e.g. 
weights in a neural network) representing the memory 
for past data, can be seen as output of a preprocessor. 
As any architectural restrictions can be related to prior 
data, any restricted optimizer may be seen as result of 
some preprocessing. 

In a prior cascade the next level algorithm treats the 
additional information as relevant data, i.e. the Bayesian 
interpretation of the template data is hardwired. In an 
input cascade the algorithm is free in its use of the data 
coming from the preprocessor. More precisely, the pre- 
processor data have a meaning implicitly implemented 
in the optimizer's algorithm and its architecture. Con- 
sider the extreme case, where the answer, the optimizer 
is looking for, is already included as an additional in- 
put dimension. This is only of any help if the algorithm 
of the optimizer has with a reasonable probability in its 
space F of possible hypotheses a function which is the 
projection of the whole input into this dimension. This 
includes that, even if all other original (/-dimensions are 
deleted, the identity (which might be parameterized very 
complicatedly) must be part of the hypothesis space to 
allow the algorithm to find this 'simple' solution. Most 
practically used algorithms can probably find the iden- 
tity or a projection easily, but one can also easily con- 
struct a model, which cannot. Formulated more tauto- 
logical, additional information is helpful if it increases 
the probability of finding a better solution. (The mean- 
ing of 'better' can be specified in many variants.) The 
solution found by an algorithm must be part of a (learn- 
ing history dependent) space F c of hypotheses, which 
have been actually considered as possible solutions by 
the algorithm. A preprocessor can transform the input 
so that the transformed optimal solution /* is (aver- 
aged over possible learning histories, if the templates are 
added before seeing all q l , and random variables of the 
algorithm) with high probability in the set of consid- 



ered hypotheses /* C F c . For example, for algorithms 
for which the projection is easily learnable, adding tem- 
plates for the output as additional input variable might 
be a good choice. 

If we assume the space of considered hypotheses F c 
to be bounded, with respect to some resources, then 
an input transformation effectively interchanges (maybe 
probabilistically) some considered with non-considered 
hypotheses. Thus, changing the considered hypotheses 
by additional information, means changing one factor 
within a multiple OR (of all f c ). Realizations of a 
multiple OR are usually very flat functions within the 
set F c (e.g. constant in the deterministic case) with 
(more or less) sharp transition to zero at the bound- 
ary. Thus, an input cascade is expected to have a very 
flat prior being only effective at the cutting edge. As a 
(highly) multiple OR is difficult to implement, informa- 
tion (D) which seems meaningful, without being easily 
related to p(f°\D) can easier be included in an input cas- 
cade, (i.e. used for preprocessing) than in a prior cascade 
(i.e. data modeling). This is usually the case in non- 
approximation problems when information is related to 
the optimal answer and not to the possible state (See 
also Section 8). 

Notice, that in a prior cascade we can change the im- 
portance of priors by changing their weight factor. Be- 
cause this is a parameter of a Bayesian model, there 
is usually no direct analogous parameter for the input 
cascade, so the importance of information is encoded 
implicitly in (the parameters describing) the algorithm. 
Assuming the Bayesian model to be correct, the prior 
cascade should give better results. In cases the Bayesian 
model is not correct, the input cascade has less bias, 
as it is similar to a prior cascade with smaller weights 
for the non-correct priors. However, adding input vari- 
ables makes the problem more complex by increasing 
the space of possible solutions. For example the VC 
dimension may increase, or the (algorithm specific, im- 
plicit) prior for the true state of nature might be smaller 
in an enlarged space. Alternatively, the complexity of 
the approximator may be reduced, for example by re- 
quiring a stronger smoothness in higher dimensions. So 
for example the support vector machine increases enor- 
mously the input dimensions (feature space), while on 
the other hand their VC dimension is expected to stay 
approximately constant. Compensating the addition of 
new variables by restriction of the search space F, is 
only successful if the restrictions reflect correct priors. 37 
They should therefore be the result of a prior cascade, 
which suggests the possibility to include this information 
also without increasing the input dimension. 

Thus, adding input variables, is only expected to help, 
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In the support vector machine, the restrictions imposed 
by constructing an optimal hyperplanes in the feature space, 
can, choosing the appropriate kernels, lead to arbitrary re- 
strictions in the original input space. Thus, the selection of 
the kernel includes the necessary prior knowledge. In the 
extreme case were the optimal solution (a one dimensional 
binary variable in binary classification) is presented as tem- 
plate, a feature space with the dimension of that solution is 
sufficient to find the optimal hyperplane. 



if they contain enough information to compensate either 
for the higher dimensionality or for skipped variables in 
the problem. In contrary, for a prior cascade also in- 
cluding variables with only small effects should improve 
the performance, as long as those effects are correctly 
modelled. An input cascade implemented by adding (or 
replacing) components to the relevant data is an appli- 
cation adapted sampling method which requires on-line 
availability of the preprocessor producing T y , and there- 
fore corresponds for an infinite set Q 1 , to a formally in- 
finite amount of data. 

These methods of an integration of available approx- 
imators, allow to use information contained in other 
learning systems, and are therefore general methods of 
knowledge transfer. 

6 Decision problems 

6.1 Definition 

In this section we study questions conditioned on data. 
Assume we have to choose one question (alternative) / 
out of a set of questions F 38 . We will now use the sym- 
bol / (actual loss) for the answers of / and assume that 
it contains all decision relevant information. Then a de- 
cision should only depend on the loss distributions of 
the various /. We choose a minimum on the set F by 
defining a (risk) functional r[p(-|/, /)] mapping probabil- 
ity densities of answers p(l\f, f(D)) into a subset of the 
real numbers, bounded from below. A decision problem 
consists in finding the answer to a question q r with 

/* = Qr(f(D)) = argmin /e/ ,r[p(- |/,/)]. 

If we want to emphasize the data dependency of the 
decision we speak of a learning problem. Approximation 
and classification problems are special cases of decision 
problems. 

We will now have a closer look to the questions /. Us- 
ing p(y q \q,f) = f df°p(f°\f)p(y q \q, f°) we will from now 
on write the formulas for states of knowledge / instead 
for pure states /°. According to Eq.(3) there exist some 
q so that the probability of suffering loss / in a given 
state of knowledge / can be written 



dy / dzp(q\y c ,z c J) 



pQlfJ) 



xp(y\q, f)p(z\q, y, /)<$(%, y, z, f) - 0- 

This implies that noise variables z of different ques- 
tions are independent and one specific realization of y 
according to p(y\q, f°) can only appear multiple times 
within one question. The situation can be represented 
by influence diagrams with decision and value nodes 
(Pearl, 1988) We can define the decision relevant basis 



We use the letter / instead of q for this set of questions 
to symmetrize the further notation. This corresponds to an 
interpretion of the selected question as action state / in anal- 
ogy to the model state /. The selected action state / = /(/) 
can be seen as reaction (e.g. approximation) to the state of 
knowledge /. 
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set X = X 1 by the set Q l of test data q l — q. Then this 
formula is in the form usually given in Bayesian decision 
theory. Starting from an factorial state with respect to 
X 1 structural information relates generalized questions 
q D from the training data to the different q — x and all 
information enabling generalization has to come from 
nonlocal q D . 

The explicit /-dependence comes from three factors 
which are 

i. the action z (including noise) producing device 

p(z\q,y,f), 

ii. the defining (loss) function l(q,y, z, /), 

iii. the test set generator p(q\y c , z c , /). 
One usually uses a formulation where the components in 
2. and 3. are chosen to be explicitly /-independent. 

i- a ) p(q\yc, zc, f) = p(q\y c , z c ), or 

b) p(q\y c , z c , /) = p(q\y c ) (fair), or 

c) p(q\y c , z c , /) = p(q) (static), 
2. l(q,y,z,f) = l(q,y,z). 

Remark: 

1. We always can fulfill conditions 1 a) and 2 by intro- 
ducing additional /-dependent variables z (and a 
zero dimensional x\) without effectively changing 
the model, as / has always to be parameterized. 
Therefore we defined / in Section 2 by p(z\q, y, /). 
To fulfill the stronger condition p(q\y c , z c , /) = 
p(q\y c ) also all random processes have to be at- 
tributed to the state by including z c into y c , and 
for the strongest condition p(q\y c , z c , /) = p(q) by 
including y c and z c into q. 

2. This covers all cases of function approximation and 
pattern classification. 

We will call a decision problem fair if condition lb 
is fulfilled so different action states / are compared in 
the same situations q independent of their own previous 
outcomes, and static if condition lc is fulfilled, and p(q) 
does not depend on any other outcomes. 

If interested in the expectation of / the remaining z 
can for static problems be integrated out defining an 
integrated loss function / 

K f h yj)= J dzp(z\q, y, f)l(q, y, z, /) 

being a deterministic function of /. 

6.2 Parallel decision problems 

If we assume p(z\q,y, /) to be y-independent this can 
be seen as a device producing answer (action) z in situa- 
tion q when the model (of nature) in state / produces y. 
Thus, / could be interpreted as (action) state of a sec- 
ond, independent model of actions capable in producing 
answers to the same questions as the original model (of 
nature). We will now write y for those z which are pro- 
duced independent of y. This independence can also be 



seen as definition of the q as that part of the visible 
variables (q,y) which can influence the answer y. For 
example, the answer of the model y can be assumed to 
be available for the device / only after the action y is 
produced. Now let us define a decision problem which 
compares two models in states / and / answering to the 
same questions q under equal conditions. This can be 
seen as comparing the action state / with the state of 
nature / in situations q according to the criteria l(q, y, y). 
If we want to model a causal dependence of the action 
y from 'previous' y, we choose the general formulation 
as 

PM: a parallel decision problem with memory, defined as 
a model with p(z\q,y, /) = p(z\q,y c , /). with the 
causal structure y c being the same as in p(q\y c , z c ). 
Then we write y for z, i.e. p(y\q, y c , /)• (The nota- 
tion allows dependencies within the components of 
y ) Thus, the action producing device outputs y in 
action state / with its components j)i independent 
of the corresponding answer component yi of the 
model of nature /. However, it can use the val- 
ues of previously determined components of y and 
choose its actions according to its 'success' in the 
past. 

We call a decision problem F 

P : a parallel decision problem without memory (or sim- 
ply a parallel decision problem) if, writing y for z, 
the action model produces y in action state / inde- 
pendent of the answer y of the model of nature /: 
p(y\q,y, /) = p(y\q, f) (Dependencies within com- 
ponents of y are allowed). 

This leads to a loss probability, written for the more 
general case: 



P(l\f,f) 



dy dyp(q\y c ,y c ) 



*p(y\q, f)p(y\q, Vc /)<$(%, y, y) - 0> 

Note the difference between not yet seen (expected) test 
data D l — (q,y,y), which are not yet determined and 
correspond to the integration variables, and the known 
(training) data D, which determines our actual state of 
knowledge / = f(D) and do not appear explicitly in the 
above formula. Parallel decision problems with mem- 
ory show an asymmetry between / and / as we allow 
a dependence of y from past values of y while we de- 
fined y only to be depending on the question q and pure 
state /° and not from y. All y-dependency of real world 
measurements p(y\q, /°) by definition has to come from 
changing the question according to p(q\y c ,yc) or active 
changes of /° by y which we do not allow. We can par- 
allel the treatment of / and /, by introducing a set of 
y-independent basis action states /° and an 'algorithm' 
p(f°\q,y c ,y c , D). Here we included possible additional 
data D in the formalism, which also determine the avail- 
able action states /, and we can write 

p(y\q, Vc, f, D) = [ df°p(y\q, f°)p(f° \q, y c , y c , D). 



For p(D\D) = p{D) we can determine / before the train- 
ing begins, otherwise, the set F of available choices has 
also to be updated with the training data D. 

Approximation problems are parallel decision prob- 
lems with approximation loss (see Section 5.3.4), pa- 
rameterized in the form / = — a lnp(y\q, /) + c with 
J dy/ ,p(y\q, /) = 1, \fq,hatf. With a temporal conno- 
tation they may also be called prediction problems. In 
such problems, we may wish to include the structure of 
the q into the action y-producing devices. For the set of 
/°, which in a parallel decision problem without memory 
already are the /, we can then define an action model 
p(y x \x, j) for the basis questions x use for a question 

P(y\qj)= dx dy x dzp(x\y x ,z c ,q) 

xp(y x \x, f)p(z\x, y x , q)6(q(x, y x ', z) - y). 
an action state 

p(v\qJ°)= dx dy x dzp(x\y x z c ,q) 

xp(y X \x, f°)p(z\x, y x ,q)S(q(x, y x \ z) - y). 

Note that also a /-independent loss function can pe- 
nalize complexity or requirement of resources of the / 
simply by including corresponding variables as compo- 
nents into y. 

While the / define the possible alternatives of loss 
distributions, the decision has to be made with respect 
to some risk functional r applied to this distributions. 
Choosing, for example, in the case of a real loss function 
for parallel decision problems the expectation as func- 
tional r we have to minimize the 'expectation risk' or 
expected risk 

r(/,/) = r[p(.|/,/)]= f dllp(l\fj) 



dy dyp(q\y c , y c )p(y\q, f)p(y\q> 2/c /)%, V, v)- 
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Density estimations are a special form of parallel 
decision problems having only one q with the mean- 
ing 'get next data'. Choosing an integrated loss which 
can for every / be interpreted as a log-probability, i.e. 
l(y, f) = \np(y\f) + c this gives 

r(fj) = J df J dyp(f\y D )p(y\f)\np(y\f). 

One may well include other /-specific aspects in the loss 
function, like /-specific complexity costs, e.g. a term 
J dyp(y\f) In p(y\f) related to the encoding costs of y 
given/ (Rissanen, 1989). Sometimes the term 'unsuper- 
vised' learning is used for such problems, including prob- 
lems which are defined by algorithms and do not explic- 
itly refer to a set of parameterized models p(y\f), like 
e.g. self-organizing maps (see e.g. Kohonen, 1995). This 
term might be misleading, because all variables y enter 
the (explicit or implicit) loss function, and are therefore 
'supervised' variables. The reason for using nevertheless 



the word 'unsupervised' is, that density estimation is of- 
ten applied to variables, which are used in a second step 
as q in a problem, where the loss function does not de- 
pend on q but on y. Indeed, the determination of p(q) 
can in our formulation always be seen as a preprocessing 
step, because we explicitly require p(q) to be indepen- 
dent of/ . 

To adapt to new data, parallel decision problems re- 
quire an inversion to obtain 



p(f \y ,q ) 



p(y D \q D J°)p(f ) 
p(y D \q D ) ' 



While this corresponds to an interchange of the roles of 
the variables y D and /°, we will in the next Subsection 
refer to another inversion: The interchange of y with q. 

6.3 Inverse decision problems 

Here we will discuss analysis by synthesis, where an ac- 
tion device p(z\y, /) is 'analyzed', i.e. its correspond- 
ing loss minimized, under a synthesis model p(y\q,f) 
for y, i.e. the question for / instead for its answer z. 
For example, instead of solving an approximation prob- 
lem, which is a parallel problem yielding a predictive 
model / of /, we may be interested in approximating 
the question ('cause') q yielding to y. Thus, we may 
want to approximate the inverse probability p(q\y,f). 
Notice, however that in our terminology a decision prob- 
lem with 'inverse approximation loss', parameterized by 
J = -a\np(q\yj) + c, with f dqp(q\y, f),Vy, f is not 
called an approximation problem (except for the triv- 
ial case p(y\q, f) = p(q\y,f))- An approximation prob- 
lem would require a normalization over \/q, /, i.e. in this 
case where q and y are interchanged of J dqj ,p(q\y, /). 
Indeed, only for (parallel) approximation problems the 
inequality (5.3.4) can be applied for given q, and there- 
fore, as we will see in Section 7 a maximum posterior 
approximation can be sufficient. 

Often, the variable describing the desired output of 
the action device is easier to choose as condition q to 
build possible models p(y\q, /°) of nature, than its input 
variable. Consider the following example: Let y a have 
the values 'face' and 'non-face' and q a be correspond- 
ing images of faces and non-faces. The task is to build 
a face detector, i.e. a device p(y a \q a ,f) which outputs 
an approximation y a of y a for a given image q a . That 
would mean we need a model p(y a \q a , /), i.e describe for 
a given image possible probability distributions being a 
face. This might be done using fuzzy priors, but might 
not be very reliable, as the probability for a given image 
to be a face is related to the kind of non-face objects 
included in y a . Every change in the set of non-faces, 
would require an update of the fuzzy prior. Much more 
natural is it — as probably more related to the subjec- 
tive process of generating priors by contemplating about 
typical features of faces — to approximate the inverse 
probability of images q a given it is a face y a . This can 
be done (as well as for the inverse direction) by a fuzzy 
decomposition of the image q a into feature components 
q a/ (e.g. eye template) according to the methods we dis- 
cussed earlier. Thus, we built a generative model for 
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images given faces. Also priors within the class of faces 
can now be formulated, independent of the non-face ob- 
jects, in terms of face generative parameters, e.g. spe- 
cific individuums, illumination conditions, rotation an- 
gels, emotional expressions. Also, the distribution for 
relevant questions p(q a ) for the low dimensional variable 
face/non-face is more easily adaptable and measurable 
than for very high-dimensional images. For example, 
one can measure or control the average number of peo- 
ple passing a camera (i.e. p{q a )) much easier than to 
change or even approximate the probability of the cor- 
responding images p{y a ). 

However, according to our convention denoting by q 
the generalized questions to nature and not the input 
to the action producing device, we have to exchange the 
letters in the notation and write q — y a for the face 
variable and y = q a for the image. Then the face detec- 
tor p(q\y, /) gives an approximate classification q for the 
variable q. 

We study the situation, where an action device inverts 
an available generative model, now more formally and 
define an 

IM: inverse decision problem with memory as a model 
with p(z\q,y, /) = p(z\y,q c f), i.e. where the the 
action model produces answers q depending not on 
the index of a generalized question q but on their 
answer y. In this situation we will use for z the 
notation q. Dependencies within components of q 
are allowed. 

In the same way we define a 

I: inverse decision problem without memory as a 
model with p{q\q,yj) = p(q\y, f), with z = q i.e. 
where the the action model produces answers q de- 
pendent on y, however independent of its history. 
Dependencies within components of q are allowed. 

Because it also seems quite natural to work with vari- 
ables representing input and output of the action pro- 
ducing device we will call 

AR : a problem to be formulated in action representa- 
tion, if the action producing device is written in 
a form p(z\q,y J) =p(y a \q a J) = p(y a \q a , y a J) . 
This defines (action) output y a = z, requires the 
/-component of the (action) input qf to include 
at least that part of qi,yi on which z z - depends. 
The remaining variables are called y a = {q, y} \q a . 
Analogously, we can define an action representa- 
tion with memory to be of the form p(z\q, y, /) = 

p(y a \i a ,ylf) = p{y a \i a ,y a J)- 

MR : We will call our original formulation with q rep- 
resenting questions and y answers a measurement 
representation. 

Table 2 summarizes the definitions, our convention 
used in the measurement notation (MR) and the action 
representation (AR). 

Now we show that every inverse model looks 'pseudo 
parallel' in action representation without being necessar- 
ily equivalent to a parallel decision problem. The reason 
is that the roles of q and y are not freely exchangeable in 



Decision problems 



p(*\q,y,f) 


parallel 


inverse 


without memory 
MR: 
AR: 


p(MqJ) 
p{y\qj) 

p(y a \q a J) 


p(z\y,f) 
p(q\yj) 

p(y a \q a J) 


with memory 
MR: 
AR: 


p(z\q,yc,f) 

p(y\q,ycj) 

p(y a \q a ,y a c J) 


p(z\y,<ic,f) 
p(q\y,qcj) 

p{y a \q\y a c J) 



Table 2: Classification of decision problems 



a decision model. We defined a model to be completely 
determined by q, y and /, the /-independent part of 
variables q, i.e. p(q\f) = p(q), and all /-dependent vari- 
ables y. That means, the distribution of q is always 
known, and can only depend on a state of nature over 
already observed data, i.e. p(q\y Cl q Cl /) = p(q\y Cl qc) in- 
dependent of /. Then the decomposition p(y\q, f) = 
f df°p(y\q, f°)p(f°\f) does not need to include a factor 
p(<?I/°)j an( l the joint probability p(q,y\q c f) factorizes 
into a /-dependent and a /-independent factor 



p(q, y\q c f) = (q\y c , q c )p(y\q, /)• 



(16) 



Accordingly, every problem in action representation can 
be written in at least a 'pseudo parallel' form. For the 
example of an inverse decision problem we can factorize 
the probability in the action representation 

p(q, y\q c , f) = p(y\q c , q c , f)p(q\y, q c , f) (17) 

= p{q a \y a c ,y a c J)p{y a \q\f c J) 

We remind, that the first line has to be read as 

p(yi\f)p(<ii\yiJ)p(y2\yuququf)p(<i2\y2>yu qiAu /)•;-, 

where p{q2\y2, 2/i> <?i> <?i> /) can not necessarily be simpli- 
fied to p(q2\y2, /) if p(q\y Cl q c ) ^ p(q), i-e. for a nonstatic 
model. The factors of this inverse picture are related to 
the original model by 39 



p(q\yAcJ) 



p(q\ycAc)p{y\gJ) 
p(y\q c AcJ) 



and 



p(y\qcAcJ) = / dqp(q\y c ,q c )p(y\qj). 



where the symbol J denotes a 'causal' or 'conditional' 
integral defined as J c dqp(q) = Y[ { J dqip(qi\q c ^) (not 

I Hi d qi P( c l) witn Qc(i) = UjJ < *}> so tnat for ev- 
ery factor the g^^-dependency remains. 40 Notice, that 



See Saul, Jaakola, & Jordan, 1996, Jaakola & Jordan, 
1996, for a variational method for such inversions in sigmoid 
belief networks. 

Jordan, 1995, shows the interesting fact that Gaussians 

and the logistic function —-^ — -, which is often used as acti- 
os l-\-e~ z ' 

vation function in neural networks, are for binary classifica- 
tion related by such an inversion. For binary y % £ {2/1,2/2} 
and Gaussian p{x\yi) with equal variance (or covariance ma- 
trices) p( yi \x) ~ p(x| ^ M ^ } - 1 



this is not the inversion necessary to obtain p(f° \D) from 
P(y\q, f°)- For a static model, i.e. for p(q\y Cl q c ) - p(q), 
the joint probability factorizes also in the inverse nota- 
tion, p(qi)p(yi\qi, f) = p(yi\f)p(qi\yi,f), with the indi- 
vidual factors p(yi\f) still being state dependent. If we 
try to exchange the role of q and y we find in Eq.(17) 
compared to Eq.(16): 

1. History dependent questions. To interpret y to a 
question in a measurement representation (and / as a 
state) / and y must completely determine p(q\y, q c , /) = 
p(<?|2/j/)j like it is the case in the inverse of a static 
model. In other cases one can define a set of history 
(i.e. q c , q c , y c ) dependent questions y[ (or states) labeled 

by {yi,y c (i),qc(i)Ac(i)}- 

2. State dependent questions. According to our def- 
inition of questions y can only be interpreted as ques- 
tion (for nature not for / !) if state independent, i.e. 
p(y\q c , q^ /) = p(y\qc, qc)- In other cases the distribution 
of relevant 'questions' y in the inverse picture is state de- 
pendent, i.e. not exactly known, and they are therefore 
effectively part of the unknown variables, which we call 
answers. It is always possible to enlarge F° to include 
q and write p(q\F q ) = p(q\D<) = J df q p(f q \D<)p(q\f q ) 
depending on data D q which have been used to de- 
termine the probability distribution. Thus, the space 
F° = F® (x) F®\ also contains hypotheses about differ- 
ent possible parameterizations p(q\f°). However as long 
as there is a 'factorial border' between f? and f°, , i.e. 

p(f°) = p(fq)p(fy q ) factorizes, the problems for q and y 
given q (not y alone) are independent. As soon as they 
are dependent, we must perform the decomposition into 
pure /°, states (/-dependent, i.e. we must treat q as an- 
swer. For example, when not only p(y\q, /) but also p(q) 
is sampled when obtaining data, this is relevant for de- 
termining p(q), in any case when it is not yet completely 
known. In this case conceptually both q and y are part 
of the answers, let us say (q,y) = y f , and the set of q f 
is reduced to one element with the meaning: 'Get next 
data pair'. However, when the hypothesis spaces factor- 
ize and the data do not induce new correlations, both 
problems can be treated separately, and q can be used 
as question for the y problem. 

We finally notate the probability to suffer loss / in the 
measurement representation 



p(l\fj) = jdqjdqjdyp(q\ 



ycAc)p(y\qJ) 



^2 p( x \yj)p(yj) 



p( x \yj) 
p( x \yjj:i) 



p(yj) 
p(yjz£i) 
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xp(q\yAcJ)HHqA,y) - 

with expected risk 

K/> /) = d q d q d yp(q\yc q c )p(y\q, f)p(q\y, ?c/)%, q, y)- 

and in action representation 

p(l\fj) = Jdq a Jdy a Jdy a p{q a \yt,yt,f) P (y a \q a ,yt,f) 

gives a logistic function because for equal variance the term 
quadratic in x cancels and only the difference of the first 
moments remain. This holds also if the p{x\yi) belong to the 
same exponential family with possibly different first, however 
equal higher moments. 



xp(y a \q a ,y a cf)W,y a ,y a )-i)- 

with expected risk 



r(f,f)=Jdy a Jdy a Jd q a p( q a \y a c ,y a cJ)p(y a \q a ,y a J) 

My a \i a ,y a c J)i(<i a ,y a ,y a )- 

We conclude: An inverse decision problem cannot nec- 
essarily be formulated as a parallel decision problem, be- 
cause its action representation yields in general state de- 
pendent q a . 

We will see later that for certain static decision prob- 
lems a one step maximum posterior approximation is 
sufficient only for approximation problems, i.e. when 
p(y\q, f) itself shall be approximated. Then this step 
has numerically the form of an empirical risk minimiza- 
tion. Obviously, approximating the inverse probability 
p(q\y, /) is not an approximation problem for p(y\q, /), 
(except when p(y\q, /) = p(q\y, /)) and we have just seen 
that it cannot be transformed into one. Thus, for inverse 
problems a maximum posterior approximation should in 
general be completed by a second step. This two step 
procedure has for inverse problems the following form: 
In a first step p(y\q, /) is approximated by p(y\q, fapp)- 
In the second step the optimal approximating action de- 
vice f a pp is identified with a (fixed) state of knowledge 

/* = fapp, and accordingly y with y. Then p(q\y,f) is 
chosen to minimize the loss for the approximating de- 
vice, i.e. for given state /* . However, for a full Bayesian 
treatment, and for the justification of an empirical risk 
minimization the difference between parallel and inverse 
problems is in no way conceptually important. 

6.4 Algorithms 

Often the process of calculating the answer to q r , that 
is the optimal learning or decision algorithm, is to dif- 
ficult to be actually performed. Thus, one has to use 
a simplified decision or learning algorithm, i.e. another 
question q r — a, to produce an answer / = a(f(D)). 
We discuss the following for parallel decision problems, 
but it applies analogously to inverse and general fair de- 
cision problems. If we have to decide between several 
available algorithms this corresponds to another (higher 
level) decision problem 41 , with the y producing device 
with p(y\q, f) replaced by a /-producing device a with 

PifVh VciVcti), an d we have to define the a dependency 
of the loss which could be formulated by extending the 
set of relevant questions Q l — » Q I,a to include loss rel- 
evant aspects. If no algorithm specific aspects have to 



This problem of finding an optimal algorithm can be a 
much more complicated problem than finding the optimal 
decision. So comparison of algorithms can be done in only 
a few number of (simple enough) cases (See for example, 
Watkin, Rau, & Biehl, 1993 and references therein for a re- 
view). When also using approximations for the high level 
problem all the same problems appear on the higher level 
again. But if we assume the decision problems of the differ- 
ent levels to be similar we could at least check for consistency: 
Does approximation A on level i produce approximation A 
on level i ' — 1. 



be included, only an additional /-integration has to be 
performed. The same is essentially true if the algorithm 
specific loss can be represented by a q, y, /-independent 
constant. On the other hand, loss related aspects which 
depend on more than one of the original relevant ques- 
tions q l also depend on the correlations between q l (for 
which the original loss, and thus also the risk, is insen- 
sitive) and can generate a much more complicated set 
of relevant questions and associated loss / for the new 
problem. 42 This expression corresponds to the expres- 
sion p(f°\q,y c ,y c , D) we discussed for parallel decision 
problems. Thus, from this point of view the / corre- 
sponds to possible pure action states /°, and the al- 
gorithm a to the available data D. Interestingly, one 
notation only refers to data, and assumes the algorithm 
to be implicit, while the other refers to the algorithm 
and assumes the data to be implicit. Thus, we have a 
parallel decision problem with memory where the / are 
part of the internal noise variables y a = (y, /). The q l 
are related to a loss function / which can depend on a 
and q r (a) chooses the a which minimizes a functional r 
of the loss distribution 



P(l\af) = df jdq jdy dyp(q\ 



yc,Vc,fc)p(y\q,f) 



xp{y\q, f)p(f\q, y c , y c , a, f)6(i(q, y, y, /, a) - /). 

A loss l(q,y,y, /) can be chosen /-independent by in- 
cluding additional variables y = y(f) if necessary. To 
include algorithmic specific aspects without using an ex- 
plicit algorithm dependent loss function we can, for ex- 
ample, include a-dependent internal variables y a in the 
loss function produced with p(ya\q,a>) measuring for al- 
gorithm specific variables like their requirement of re- 
sources like calculation time, memory and other aspects. 
An algorithm is defined by its p(f\q,y c ,yc, f,a>)- It 
produces an answer / which should at least be depend- 
ing on its state of knowledge /, that is its prior proba- 
bilities and training data. We always assume its knowl- 
edge of the /, i.e. of the p(y\q,f). The dependence of 
P(f\q,yc,yc,f,a) from y c and y c allows the algorithm 
to adapt, i.e. learn, during the test phase. In that 
case the loss evaluates the learning curve of algorithms. 
One can allow the choice of / to be dependent from 
the test question q being equivalent to enlarging the 
space of available /. The variable / can be a vector, 
for example, if a decision is required after presenting 
part of q. In situations with p(q\y Cl y Cl f c ) - p(q\y Cl 2/c) 
and I(q, y, y, f) — I(q, y, y) algorithms can be seen as im- 
proved /-dependent y-producing devices with 



p(y\q,ycj/a) 



/ dfp(y\qj)p(f\q,y c , 



Vc,f,a). 



Choosing for real loss r as the expectation functional 
we are looking for the a which minimizes 



r(f,a) 



r[p(-\f,a)} = Jdllp(l\f,a) 
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42 Compare the distinction between / and L in (Haussler, 
1995). 



= df dq dy dyp(q\y Cl y Cl f c )p(y\q, f)p(y\q, f) 

X P(f\q, Vo Vc, /, a)l(q, y, y, /, a). 

6.5 Minimization 

Solving decision problems requires minimization algo- 
rithms. In general, also the minimization problem can be 
seen as a learning problem: Given data of function values 
and information about the function (e.g. differentiable, 
symmetric) give the position of the minimum. In so far 
our discussion of learning and prior information also ap- 
plies to optimization. Usually, minimization algorithms 
perform active queries for local data (for example, taking 
a new data point in direction of the gradient) to improve 
a given guess for the location of the minimum and pro- 
ceed iteratively until a certain convergence criterion is 
fulfilled. An iteration procedure can be written 



r 



+1 



G(i,D l ,f) = G l (f), 



where f % denotes the current guess for the location of 
the minimum at step i and Di (which for the sake of 
simplicity will be skipped from now on from the no- 
tation) the new and accumulated previous data (Di — 

( r (f]f 3 )]f 3 )}j<i)- Besides data points G depends on 
prior knowledge about the function (in our case r(/, /)). 
The past iterations provide data for the minimization 
problem, and Di indicates that G changes with the 
amount of available data, i.e. the number of iterations. 
Thus, one can say G is trained on the available data. 
This at least implicitly assumes prior knowledge which 
allows available data to carry information about other 
parts of the function. 

In general, we can allow any reparameterizations T 
(of the /, in our case) as long as T is locally injective 
at the locations of global minima /* , i.e. T _1 (T(/*)) = 
/* . Reparameterizations can be not injective for points 
which are no minima, as those can be excluded from fur- 
ther search. But also under globally bijective reparame- 
terizations minimization problems can look quite differ- 
ent for the transformed variables. Reparameterizations 
can be nonlinear, differentiable or non-differentiable, or 
linear transformations for vectors or functions (like /) 
i.e. a change of the representing basis if F is a vector 
or Hilbert space, or even a random permutation of the 
function values. Transformations can create arbitrary 
neighborhood relations, so the minimization problem can 
become trivial, like for example when the function values 
are ordered monotonically (to find such a permutation, 
however, the minimum problem has to be solved), or 
arbitrary hard (non-smooth, random). 

Sometimes, it is technically helpful to 'linearize' prob- 
lems, by giving every degree of freedom its own linear 
dimension. If we define R to be the space of all pos- 
sible functions r(/, /) (for fixed /) and give the values 
of r a linear structure, we can expand any r, at least 
formally, into basis functions r = J^. a z '6 z -(/). Thus any 
function r f (f) = X^" a z'^'(/) on ^ taking values in the 
linear range of r(/, /), for example r f (f) = r(T(f)), is 



by construction of R a linear transformation A(f, /) of 
r(f). Then r' = Ar, which is defined by the mapping of 
the coefficient vectors with a 1 — Act. 43 The resulting di- 
mensionality may however very soon be intractable huge, 
e.g. for infinite F the resulting space has infinite dimen- 
sion. To be able to use such a space for calculation, it 
must be restricted. In Hilbert spaces for example only 
functions are allowed for which the expansion in a ba- 
sis can at least be arbitrarily well approximated in some 
norm by an (arbitrary large, but) finite number of basis 
functions. 

Reparameterizations always change only the argu- 
ments not the function (risk) values itself. There is 
also the possibility to change the function values without 
changing the location of the minimum. In general, for a 
function / (in our case the risk, i.e. / does not denote 
a state here) the positions x* (=/*) of global minima 
/(**) (=r(J,h), defined by y* = f(x*) < y = f(x), 
Va?, do not change under transformations h which obey 
y* < y o h(y*) < h(y),\/y. We will call such transfor- 
mations h strictly monotonically increasing relative to 
y* . (Analogously, we define strictly monotonically de- 
creasing relative to y* y* < y O h(y*) > h(y),\/y.) As 
we do not know y* = f(x*) in advance we have to require 
this relative to all values which are possible candidates 
for a global minima. 

Minimization methods can only return a minimum 
within some selected finite (sub)set of considered func- 
tion values. Local methods (see below) for example find 
local minima. Locations of minima are invariant for all 
subsets under strictly monotonically increasing transfor- 
mations h defined by y < y f =$> h(y) < h(y f ),\/y,y f , 
equivalent to strictly monotonically increasing relative to 
all y, i.e. y < y f O h(y) < h(y f ),\/y, y f . Analogously de- 
fined strictly monotonically decreasing transformations 
change a minimum to a maximum. 

There is a large variety of methods and concepts avail- 
able for minimization, with many possibilities of combi- 
nations. (For a discussion with respect to neural net- 
works see for example Golden, 1996. In Section 9 opti- 
mization methods are needed for maximizing the poste- 
rior probability.) The fact that the following principles of 
minimization algorithms can be applied to transformed 
(strictly monotonically increasing relative to global min- 
ima) and reparameterized (bijective at locations of global 
minima) problems, makes clear, on one hand, that in 
general a large number of possibilities can exist to attack 
a specific problem. On the other hand it shows that also 
the optimization process depends on prior information. 

The simplest method, which does not refer to any 
dependencies of function values, is 

1. an unadapted search (stochastic, predetermined de- 
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43 See for example generalized additive models, where in- 
teraction terms are added (Hastie & Tibshirani, 1990), the 
comments in Minski-Papert-1990 (the new edition of the 1969 
book) about the general applicability of (linear) perceptrons, 
the support vector machine, where linear relations in the 
feature space correspond to nonlinear relations in the input 
space (Vapnik, 1995), and for a general approach, Smola & 
Sch61kopf, 1997. 



terministic, or exhaustive), i.e. with data Di col- 
lected independent of function (risk) values for /. 
One iteration step consists in sampling of a new 
data point, and G compares this point with the 
current guess f l . The sampling distribution does 
not depend on previous iteration steps. 

Deterministic algorithms (deterministic G) are special 
cases of stochastic algorithms (probabilistic G). Prior in- 
formation enters G if the sampling distribution at step i 
depends on D{. The most common case are local depen- 
dencies yielding 

2. local iterative methods. For smooth functions small 
values are in the neighborhood of other small val- 
ues and therefore the search (sampling probability) 
is concentrated near the current guess of the min- 
ima. As differentiable functions have a gradient 
equal to zero at a minimum this allows to search 
for those zeros which are possible locations of min- 
ima. Sometimes stationary points can be found an- 
alytically, but usually nonlinear equations require 
iterative solutions. (The term 'analytical' is com- 
monly used for solutions where the iteration can 
be done easily, like determining a certain square 
root or the numerical evaluation of constants like 
e or 7r.) Technically, the iteration is often imple- 
mented in the form of a relaxation method (see 
Section9), which includes common algorithms like 
those based on the gradient, and its stochastic (like 
on-line learning), restricted (e.g. line search) vari- 
ants, and may include higher order derivatives (as 
in Newton methods) (Pierre, 1986, Bazaraa, Sher- 
ali, & Shetty, 1993, Bertsekas, 1995). For these 
algorithms G depends on the value of the function 
and some of its derivatives at the current location. 
In a discretized implementation of derivatives, (and 
similar in a simplex search method) G depends on 
a set Di of function values r(f) which can be called 
a local population. Those methods find local min- 
ima. A nonlocal aspect can be added simply by 
combining them with unadapted search methods, 
usually implemented by comparing different local 
minima. 

There can be nonlocal dependencies between function 
values, which makes it desirable to have G being depen- 
dent not only on a local but also on a nonlocal population 
of function values, leading to 

3. parallel or nonlocal iterative methods. Here G de- 
pends on a nonlocal population D{. like in genetic 
algorithms, (See Holland, 1975, Goldberg, 1989, 
Davis 1991, Michalewicz, 1992, Schwefel, 1995, 
Mitchell, 1996) This allows nonlocal interactions 
between a possibly large number of function val- 
ues. The dynamic of the population Di — » £\+i 
corresponds to an iteration for a population vec- 
tor. 

In general, we only have to require that a fixed point 
of the iteration corresponds to a solution of the mini- 
mization problem. This allows transformations of the 
problem during iteration: 
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4. Transformation methods use transformations of the 
problem (i.e. of G z ), starting with an easy solvable, 
e.g. one-minimum problem, and slowly transform- 
ing to the problem of interest. They are called 
homotopy or continuation methods if they approx- 
imate a smooth family of transformations (Allgo- 
wer, 1990, Richter & DeCarlo 1983, or, for an ap- 
plication in scattering theory, Giraud & Nagarajan, 
1991, Wierling et al. 1994). Parameter, like the 
step width in gradient algorithms, mutation rate in 
genetic algorithms, or the temperature in simulated 
annealing (See Ripley, 1987, Davis, 1987, Aarts & 
Korts, 1989) can also be seen as such deformation 
parameters. 

An example of transformations which correspond 
to a strictly monotonic transformation at fixed 
points are transformations with h(y) < h(y*) =>- 
V < 2/* , Vi/, which we will call minimality sufficent 
relative to y* , if y* is updated (at least from time 
to time) during iteration. (Accordingly h(y) > 
h(y*) =$> y > y* , My will be called maximality suffi- 
cient relative to y* .) For example, adding a func- 
tion r(f) to the risk r(/, /) (e.g. a (quasi) distance 
D(f\ / z+1 )) with a minimum at the current guess 
(f l ) ensures, that decreasing both terms (r-\-r) also 
decreases the original function (r). This is, for ex- 
ample, used in the Y^M-Iike algorithms (Dempster, 
Laird, Rubin, 1979, Tanner, 1993, Gelman, Carlin, 
Stern, Rubin, 1995, for an information geometrical 
interpretation and the related (most times identi- 
cal) em algorithm see Amari, 1985, 1995). 

Non-exhaustive search is always restricted to a sub- 
space. Restricting to a subspace a priori is also called 

5. a variational method. Here the function is parame- 
terized and only a part of the parameters are used 
for minimization. In a linear variational method 
the stationarity condition is, for example, expanded 
into a linear basis of a Hilbert space of functions, 
and solved in a linear subspace. Examples in- 
clude the methods of finite elements. Variational 
methods, including nonlinear ones, are also often 
used in physics, especially in quantum mechan- 
ics. There, for example, finding a bound for the 
ground state energy of quantum mechanical sys- 
tems smaller than some instability causing thresh- 
old can have drastic consequences. In this context 
a product ansatz for functions in several variables is 
also called mean field approach. Variational meth- 
ods have recently also be applied to general graph- 
ical models. (See for example Saul, Jaakola, & Jor- 
dan, 1996, Jaakola & Jordan, 1996). 

We conclude with the interesting observation, 
that (non-exhaustive) minimization requires knowledge 
about nonlocal dependencies also for the risk functional 
r(/, /). This suggests the principal possibility of reduc- 
ing for practically solvable problems the risk functional 
to independent values, whose number in practice must 
be finite, corresponding to a finite F and finite effective 
F°. 



7 Bayesian approach 
7.1 Inserting model states 

The following constituents of a decision problem are 
assumed to be known: 1.) the action producing de- 
vice p(y\q,f), 2.) the definition of the test distribution 
p(<l\yc) 2/c), 44 3.) the loss function /(<2,y, y). For evalu- 
ating a risk functional r the main problem remains de- 
termining the answer probabilities p(y\q, /°) for the test 
questions of the realized pure state (of nature). There 
exists a Bayesian and a Frequentist approach for this 
problem. 

In the Bayesian approach model states /° are inserted 
as hidden variables. This is the concept we used in this 
paper defining a state of knowledge / with probabilities 
p(f°\f) an d is called in the context of decision problems 
Bayesian decision theory (Berger, 1985). The answer 
characteristics p(y\q, /°) of the possible pure states have 
to be known. 45 According to the Bayesian paradigm the 
(training) data dependence p(f°\f(D)) = p(f°\D) of the 
state of knowledge / = f(D) can be written 

n(f0]m = p(D\f°)p(f°) = p(y D \i D J°)p(f°) 
P[J ' ' p(d) p(y D \q D ) ' 

with q D , y D the vectors of questions and corresponding 
answers in the data. A Bayesian expected risk reads 



with 



r(/,/)= dfp(f\D)r(fJ) 



dy dyp(q\y Cl y c ) 



r(/°,/)= j 

xp(y\qJ )p(y\<iJ) l (q,y,y), 

or for an inverse setting 



r(f°J) 



dq / dq 



l dyp( 



q\y c Ac) 



xp(y\qJ°)p(q\yJ)Hq,q,y)- 

The posterior probability p(f°\D) is the only data de- 
pendent term. Introduction of model states /° makes 
the treatment of the training data independent of the 
test set. Both cases, test questions not included in the 
training data and training data not included in the test 
data are no conceptual problems. 

When p(y\x, /°) is specified by p(x\y, /°) and an /°- 
specific prior p(y\f°) according to 

p(q\r) 

one has to calculate the probability of the data under /° 

P (q D \f) = Ylp(<i D \y D ,f)p(y D \n 

(see for example Saul, Jaakola, & Jordan, 1996) to get 
p(f°\D) and integrate over different states /°. 



If not under direct control, p(q) can be determined sep- 
arately, or the q can be included in the set of y. 

45 Under the assumption of the chosen model the Bayesian 
approach is (denned as being) optimal, but in practice the 
method depends of course on the correctness of the model 
for the situation in mind. 



7.2 Maximum posterior approximation 

Practical calculations of a Bayesian risk are, if not ana- 
lytically solvable, in general only possible for a restricted 
set of /° and /. Numerical methods using Monte Carlo 
integration techniques 46 making the full integration in 
some cases feasible, are used in the area of neural net- 
works for example by the Boltzmann machine (Hinton, 
& Sejnowski, 1983, 1986; Ackely, Hinton, & Sejnowski, 

1985) and have been applied to Bayesian calculations 
(See Gelfand & Smith, 1990, Gelfand, Hills, Racine- 
Poon, Smith, 1990, Geyer, 1992, Besag & Green, 1993, 
Smith & Roberts, 1993, Tierney, 1994, or Gelman, Car- 
lin, Stern, & Rubin, 1995) including Bayesian analysis 
of neural networks (Neal 1993, 1996). 

Those methods perform an plug-in estimate of the 
expected risk. The difference to the Frequentist method 
discussed later is that the test data are generated ac- 
cording to the a posteriori distribution. One also has 
to proof that the Monte Carlo estimate converges with- 
out having calculated the exact solution. That means 
using a finite sample we have to assume or calculate 
nonlocal knowledge about the risk function. In practice 
we may check the accuracy by repeating the calculation 
and estimating the variance of the optimal / using for 
example cross-validation or the bootstrap. Those are 
methods of classical statistics and a plug-in estimate is 
technically an empirical risk minimization (see Section 
8) with f -generated virtual examples sampled according 
*>op(y\q,f). 

The problems related to the plug-in principle will be 
discussed below for the Frequentist approach. 

Alternatively to Monte Carlo methods, the method 
of Laplace can be used to approximate the risk integral. 
This is the real version of the saddle point approximation 
or method of steepest descent for complex functions (see 
for example: De Bruijn, 1981; Bleistein, Handelsman, 

1986) Taylor expansion of h(x) around its maximal value 
x* and performing the resulting integrals gives for a real 
one-dimensional function h 



df< 



-pE(f°). 



2tt 



PE&)(p>*) 



-pE(f Q n 



x ^ 1 v 24(b( 2 )(/° ) *))3 8(E(2)(/0,* ))2> /-r ^2^ ^, 

written for a function with one minimum E(f°>*) (or 
maximum for —E), E^ > 0, E^ denoting the ith 
derivative at /°'*. Interpreting l/j3 as 'temperature', 
this expansion in 1/(3 is a 'low temperature' approxima- 
tion. In the multidimensional case E { ^ in the square 
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46 Invented in statistical physics and going back to 
Metropolis, Rosenbluth, Rosenbluth, Teller, Teller (1953) 
(Metropolis algorithm) and Alder & Wainwright (1959) 
(molecular dynamics). For applications and developments in 
physics see for example Hammersley & Handscomb (1964), 
Binder (1986, 1987, 1995), Binder & Heermann (1988), or 
the last chapter in Montvay & Miinster (1994), for mathe- 
matical background on Markov chains, for example, Seneta, 
1981, and for their early use in statistics see Hastings, 1970, 
Ripley, 1977, Geman & Geman, 1984, Ripley, 1987. 



root factor for example has to be replaced by the de- 
terminant of the matrix of second derivatives. For posi- 
tive quadratic /z, i.e. Gaussian e _/3jE , all terms E^ with 
i > 2 vanish. Higher order terms are obtained via Wick's 
theorem 47 and include, in a graphical notation, an ex- 
ponential of all so called 'linked diagrams'. 48 (see for 
example Negele & Orland, 1988, Itzkyson & Drouffe, 
1989). 

Generalizations of the saddle point formula for multi- 
ple extrema exist which care about overlapping parts be- 
longing to sufficiently close extrema (Berry, 1966, Miller, 
1970, and Connor & Marcus, 1971). However, one usu- 
ally assumes the extrema to be well separated, and in our 
case, being not so much interested in the actual value of 
the Bayesian risk than in finding the best /, only effects 
varying between different / would be important. 

To find the expansion point / 0> * one has to solve the 
stationarity conditions 



jE(f) = 0, 



(19) 



which are nonlinear for non-quadratic E, and have then 
to be solved iteratively to find a self-consistent solu- 
tion. In a multidimensional case a self-consistent so- 
lution / 0> * = {f®'*\x E X} is only influenced by the 
values of f x f but no other /° E F° . Thus, / 0> * has 
to incorporate approximately the combined effects of all 
other /° E F° , so sometimes / 0> * is also called a mean 
field solution. Analogously one sometimes refers to the 
stationarity condition (19) as mean-field equation and to 
the saddle point approximation as mean field approach. 
According to Eq.(18) one might apply the sad- 
dle point approximation to the whole /°-dependent 
integrand p(f°\D)r(f° , /) or y— and (/-dependent to 
p(f° \D)p(y\q, /°). Both variants are clearly usually too 
complicated, leading for example to a /-dependent fac- 
tor depending on the second derivative. 49 But having 
a large amount of data or strong nonlocal dependen- 
cies it is often reasonable to assume the posterior prob- 
ability p(f°\D) to be peaked sharply around one maxi- 
mum. More formally, we may identify (3 with the num- 
ber n of training data, and assume that the sample mean 



Which is a systematic way to calculate (multidimen- 
sional) Gaussian integrals over polynomials. Those arise 
when expanding the remaining exponential factor. 

48 The difference between 'linked' and 'unlinked' diagrams 
is similar to those between moments and cumulants generated 
by < e > or In < e >, respectively, in a high temperature 
expansion (See Section 5.3.2). The general relation between 
expanding a sum of exponentials and its logarithm is also 
known under the name 'linked cluster theorem'. 

On the other hand the /-dependency is an interesting 
feature if r(f°,f) can be included , because then the saddle 
point approximation can be adapted to /. It would lead to 
coupled maximization and minimization problems. (In con- 
trast we will discuss below a maximization problem which is 
independent of the subsequent minimization problem.) This 
dependency of the maximization problem on the correspond- 
ing minimization problem, i.e. its adaption to /, suggests 
for example an iterative procedure where both steps are per- 
formed alternately. 



^T,i ln P(yihJ ) of tne variable z(y,q) = lnp(y\q,f°) 
becomes for large n a nearly n-independent function. In 
this case a large n (many data) correspond to a low tem- 
perature l//?. 50 Then the second order Taylor expansion 
of the log-posterior lnp(y\q, /°) (i.e. a Gaussian approx- 
imation for the posterior) can be a good approximation. 
For example, under some regularity conditions, (e.g. the 
number of parameters included in /° is not chosen to in- 
crease with the sample size, the limit is not at the edge 
of the parameter space F° , or one uses a model specifi- 
cation where different parameter values /° correspond 
to identical probabilities p(y\q, f°) at the maximum) 
this will be the case for i.i.d. random variables z(y, q) 
with finite variance, according to the general asymptotic 
Gaussian limit theorem (Le Cam, 1953, 1986, Le Cam 
& Yang, 1990, see also the discussion and references in 
Chapter 4 and Appendix B of Gelman, Carlin, Stern, 
Rubin, 1995). Then the posterior will have a variance 
(nJ(/°)) _1 where J is the expectation under the true 
state of (the matrix of) the second derivatives of the log- 
posterior (Fisher information). For dependent variables 
this is not necessarily true, but can also be the case. 
Then, if r(/ ,/) varies only weakly with /° compared 
with p(f°\D) (Gaussian p(f°\D) alone is not enough in 
this case) it does not strongly influence the location of 
the maximum. Then we can identify j3 with n, and E(f°) 
with - ^(^- lnp(y z |^,/ )+lnp(/ )). (The factor n dis- 
appears in the stationarity conditions for h(x) if those 
are multiplied by n ) We do however not restrict to cases 
where we interpret (3 — n. Having nonlocal prior terms 
(interactions) in the exponent the saddle point approx- 
imation can be a good approximation even for a small 
n of local data, if the dependencies induced by the prior 
(interaction) terms restrict the number of function with 
high probability strongly enough. Indeed, from physics 
it is known that mean field theories (saddle point ap- 
proximations) can become exact when the correlations 
are strong, like for long-range forces or for local forces 
in high dimensional spaces. (For many physical models 
with local interactions, like the Ising model d > 4 is the 
dimension above which the mean field theory is valid.) 
In the case where only part of the exponent is multiplied 
by /?, i.e. the integrand has the form e / 3/i (^)+ ln ^(^) j we 
can apply the slightly more general formula 



dfr(f)e-^°^r(fn 



2tt 



PE&)(p>*) 



-peu q >*) 



with r(/°) corresponding to r(/ ,/), —f3E(f°) to 
^(/°)/)j an d /°'* is the location of the minimum of 
E(f°) (maximum of L(/°, /)), i.e. independent of r(/°). 
This is called maximum posterior approximation 
(MaP). Especially for high dimensional spaces F° the 
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50 Similarly, in field theories (e.g. quantum theory) in a Eu- 
clidean (imaginary time) formulation the system size is re- 
lated to n, while in the corresponding interpretation as classi- 
cal statistical system the parameter is an inverse temperatur 
ft, and the evolution operator for imaginary times appears 
as "transfer matrix". In particular, the large ft limit corre- 
sponds to the limit of large system size. (See for example 
Zinn- Justin, 1989.) 



off-peak contributions can be large requiring a large 
amount of data to allow a MaP approximation. Note 
also, that in the context of decision theory we only need 
to evaluate the Bayesian risk to select a optimal / and 
we do not have to require a good approximation of the 
risk itself, and /-independent factors are therefore not 
important, like the ones including the second derivative 
h( 2 ) or the factor p(y D \q D ). 

Hence, to apply a q-, y-independent MaP approxima- 
tion with respect to /° we write the /°-dependent but 
q-, y-independent probabilities in exponential form 



P(y?\q?,f°) 



„£ D (y?l«?,/°) 



p(/°|/) = e L ° ( ' 0) > 
defining the log-probabilities L D (log-likelihood), L° 
(log-prior) and get 

K/,/)oc/rf/ e^ LDtofkf ' /0)+L ° (/0) )K/ ,/). 



If we include the log-prior lnp(/°) into h{x\ it is also 
included in the determination of the maximum h(x*). 
This allows to discuss situations where the n data are 
not enough to yield sharp, especially non-degenerate, 
maxima (and indeed, choosing the relevant questions as 
basis questions, X — Q l ^ the prior is always essential for 
local data if n < \X\). In contrast we assumed r(/ ,/) 
not to be important for the location of the maximum. An 
extreme example for the opposite would be if the risk for 
the most probable state / 0> * is for all / infinite, meaning 
that / 0> * can be excluded from F° . Less extreme, a 
risk can be strongly peaked at an /° with low posterior 
for some /. On the other hand, it only matters which 
/ is finally selected as the optimal one. This implies 
robustness against all errors, made somewhere on the 
way, which do not change this final decision. 

We summarize that the MaP approximation includes 
probability aspects, but not aspects of relevance related 
to the loss function. 

Maxima may be degenerated or weakly peaked within 
a subspace of parameters of F° . Then one may perform 
a partial saddle point approximation for the subspace 
where the necessary conditions are fulfilled. For exam- 
ple, the risk may measure aspects of (i.e. depend on pa- 
rameters describing) /° which are not measured by the 
data, so p(f°\D) does not dependent on them. Then 
the locations of the maxima depend on the risk r(/°, /) 
and the maxima are necessarily degenerated in direction 
of those relevant but not measured parameters (e.g. in 
a model with X — Q l ^ uniform prior, and only local 
data). Then the MaP step returns not a single point but 
a subspace of important possibilities. More general, the 
maximum can, after incorporating data and prior, still 
be rather flat in some of the relevant dimensions. Then 
a MaP approximation should only replace a part of the 
^-integration while another part, i.e an integration over 
a subspace of parameters of /°, remains. Performing 
this integration gives a new effective risk sensitive to the 
available data. 



The MaP approximation consists in finding the most 
probable state / 0> * to approximately calculate the /° in- 
tegral. Then the factor £. . L D (yP \qf , f°>*) + L°(f°>*), 

being independent of/, can be skipped and the optimal 
action state /* can be found by minimizing r(/ '*,/). 
Often min? r(/°, /) is a constant over all /°, like in the 
usual regression case with mean square error, determinis- 
tic unrestricted /, and Gaussian /° with /°-independent 
variance. Thus, the full approximation procedure con- 
sists of two steps (MaP-MiR) 

1. Maximization of the posterior (MaP): 

/"•* = argmax.o (f^ L D (yf \,qf> , f) + L°(/°)) , 

2. Minimization of risk (MiR): 

(for the state / 0> * with maximal posterior) 

/* = argmin^r(/ '*,/). 

Note that in this approximation the MaP step is per- 
formed independent of the aspects important for the 
MiR step. 51 The first step can be interpreted as find- 
ing an approximation independent from its application. 
The second step uses the best found approximation for 
a specific application situation defined by the loss func- 
tion. This is the usual implicit setting when looking 
for approximations without specifying applications for 
which they will be used. It has the advantage that the 
independence allows the same approximation to be used 
for several different applications. Thus, in the every day 
use of this procedure a statistician performs the first ap- 
proximation step and potential users of the approxima- 
tion the second, adapted to their problem. An exam- 
ple for a MaP-MiR related algorithm can be found in 
(Lemm, Beiu, Taylor, 1995). There the MaP step is 
implemented as a density approximation. A subsequent 
constructive algorithm tries to find a solution easy to im- 
plement in hardware, not working directly with the data 
but with the results of the density estimation of the first 
step. This is an attempt to minimize also aspects of the 
loss not related to approximation and corresponds to the 
MiR step. 

The loss function depends on the action state / which 
may include aspects like complexity of /. But note that 
the loss measures no aspects like complexity of the al- 
gorithm used to find the optimal (or a good) /. So to 
say, action loss is included but no algorithmic loss. As 
a two-step procedure can often be expected to be more 
complex than a one-step procedure, the MaP-MiR pro- 
cedure seems to be more appropriate for situations where 



45 



In statistical practice or biological reality where on-line 
learning is required (and the model spaces F and F may 
be adapted) both steps can of course be performed inter- 
laced. See for example the Helmholtz machine (Dayan, Hin- 
ton, Neal, & Zemel, 1995; Hinton, Dayan, Frey, Neal, 1995), 
where in the 'learning phase' (MaP step) a 'generative model' 
(state / ) is adapted and in the 'dreaming' phase (MiR step) 
the 'recognition model' (action /) is optimized for given gen- 
erative model F • 



loss related with /, for example its approximation ability 
and complexity, are more important than aspects of the 
loss related to the requirement of resources of the algo- 
rithm. But as stated in Section 6 one can also consider 
algorithms as part of / in a higher level problem. Then 
one can include specific algorithmic aspects of the loss 
and look for an optimal algorithm for a certain distribu- 
tion of application (learning) situations. Again, to solve 
for the best algorithm one has to use a meta-algorithm 
and the same kind of problem appears on this higher level 
as here meta-algorithmic loss aspects are not included. 
Going further, meta-algorithms could be included into / 
using meta-meta-algorithms and so on, but usually the 
complexity from one level to the next increases so much 
that such applications are not expected to be feasible in 
most practical cases. 

There are cases where p(y\q, f°) depends on a huge 
number of internal ('hidden') integration variables vari- 
ables z. Then an approximated log-posterior must be 
maximized. Exchanging a nonlinear function with the 
integration is called annealed approximation (Seung, 
1995). The replica approach is a special adaption of 
a saddle point method when the logarithm of a sum 
gi — In ^2^ e Lj has to be averaged with weights pi 
(Mezard, Parisi, & Virasoro, 1987). This situation can 
occur, for example, if algorithms are compared with re- 
spect to a large (n — » oo) number of application situa- 
tions where the average is over the sampled data. 

The second MiR step uses p(y, q\f°) to find the best 
alternative /, i.e. for 'training' of the action model. A 
numerical evaluation of the integral can be seen as an 
empirical risk minimization (see Section 8) on a set of 
(f° -generated) virtual examples sampled according to 
p(y, q\f°), like a numerical evaluation of a full Bayesian 
approach can be based on /-based virtual examples gen- 
erated according to p(y, q\f). 

Note that the distinction between a parallel and 
inverse decision problem, corresponding to choosing 

PiyVh f) or p{i\y^f) to define the action model, only 
matters for the MiR and not the MaP step. Also, the 
model needs not necessarily to be reduced to only one 
remaining state. Thus, MaP-MiR can be generalized 
with a first step reducing the space F° to a smaller 
F° and a second step minimizing the risk within that 
F° . But using more than one state for the minimiza- 
tion step, like taking for example the n most probable 
states or skipping only very unlikely ones, requires cal- 
culation of the relative weights of the states related to 
second derivatives with respect to the parameters of /°. 
Every decision problem defines an optimality mapping 
by /(/°) = argmin /G# r(/°, /) for all f° £ F° and anal- 
ogously for / E F . In principle, one can restrict the 
search space F by eliminating / being never optimal. 
In the case available data have equal probability within 
/° E [/°]r* i- e - they do not distinguish between states /° 
leading to the same decision / identifying those makes 
optimality mapping one-to-one. With respect to such a 
construction the space F° , and therefore also the MaP 
step as maximization over /° E F° , is not independent 
of the loss function. 



7.2.1 Perturbation theory beyond MaP 

The MaP approximation can be extended by expand- 
ing the exponential to higher orders around the Gaussian 
reference point. We already mentioned that higher or- 
der contributions can be obtained by including all linked 
diagrams according to Wick's theorem. In general, a per- 
turbation theory can be built upon any reference point 
based upon the general formula (See the Section about 
Heim's perturbation theory in Jaynes, 1996) 



with 
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e A + eB = e A 
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dx n Y[B(x n ), 



B(x n ) = e- XnA Be XnA . 

Here A, B stand for matrices, x for a real number and 
xA means multiplication of each entry of A by x. In 
matrix notation we have 

r(fj)=<l(q,yj)> 

= Ti(p(q, y, f°,f)l(q, y, /)) = Tr(e L (g, y, f°,f)l(q, y, /)), 

with L(q,y,f°,f) = Hp(f \f)p(q)p(y\q,f )), the trace 
denoting the integrals Tr= J df° J dq J dy for diagonal 

matrices L(/) ZJ -, l(f)ij with indices i = (<2,y, /°), j = 
{q',y',f°), and p(q,y,f°,f) = e L (q,y, f° , /). Because 
the trace is invariant under similarity transformations 

5 this formulation allows, if convenient, to work with 
nondiagonal V = SLS~\ V = SI'S' 1 . 62 

If we now write L = A + eB, the expected risk can 
be expressed completely by unperturbed expectations 
< • • • >a with respect to the reference e A 

<l> - < ~1>A- 

oo 

^e n (< QJ> a ~ <Qn >a< 1>a), 

n = l 

with 

n-l 
Ql = l;Qn + $n - ^2$k < Qn-k >A, U> 1. 
fc = l 

8 The Frequentist approach 

8.1 Empirical risk minimization 

The Frequentist paradigm is related to the general plug- 
in or bootstrap principle. 53 It is also called Monte Carlo 
estimate if applied to calculate expectations (Efron, B. 

6 Tibshirani R.J. 1993). It is based on results of the 
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See for example Derka, Buzek, Adam, & Knight (1996) 
for Bayesian inference with density operators used in quan- 
tum theory. 

Often the term bootstrap refers to the special case of 
estimating the standard error of some sample estimate which 
then requires resampling. 



theory of uniform convergence. A functional of a pop- 
ulation distribution is estimated by applying the same 
functional to an empirical sample drawn according to 
that distribution. 54 In our case the risk functional r 
is replaced by an empirical risk f: Training questions 
q D are generated according to p(q D \yf,yf) (or usually 
p{q D )) and the risk functional r is applied to the empir- 
ical distribution of 

In the case the risk functional is chosen as the expec- 
tation and p(q\y c ,z c ) = p(q\y c ) we define an integrated 
loss function 



/ 



dzp(z\q, y, /)%, y, z) = %, y, /), 



Specifically, in parallel decision problems z is equal to y 
and in inverse problems equal to q. Note that in I(q, y, /) 
/ just denotes parameters. Yet, for a given loss func- 
tion we can always introduce some effective determin- 
istic function f(q) (or f(y)) (not uniquely defined and 
possibly vector valued) containing the dependence of / 
from the parameterization of / and (part of the depen- 
dence from) q (or y) according to I(q, y, /) = I(q, y, f(q)) 
or l(q,y,f) = l(q,y, f(y))- That means every decision 
problem with p(q\y c , z c ) — p(q\y c ) can be seen as equiva- 
lent to a decision problem with deterministic function /. 
For a parallel decision problem with a deterministic y- 
producing device we choose the functional dependency 
of / from its first argument q to be equal to that of / from 
its first argument q, which gives / = / and f(q) — f(q). 
Alternatively, we can simplify the notation by absorb- 
ing the first q into the definition of f(q) and write in 
such cases /(y, /(<?)). Analogously, in an inverse setting 
we define f(y) and can choose in the deterministic case 

f(y) = fry). 

The expected risk 



HfJ) 



dyp(q,y\f°)i(q,yj), 



with p(q, y\f°) = p(q\y c )p(y\q, f°) is approximated by 

i i 

with the data sampled according to p(q, y\f°). 

The plug-in principle assumes the distribution of 
training data D including the distribution of questions 
to be identical to that of the test data D l including p{q l ) 
for the relevant q l for which we want to calculate the 
functional r. This holds for example in a setting where 
both training and test set are generated by the same 
(stationary) device. Then an explicit knowledge of the 
generating distribution is not necessary. 

Because the empirical risk values are used to make the 
decision, the chosen / depends on the training data and 



As empirical distributions of finite samples are a quite 
restricted class of distributions, functionals equal for them 
can differ on general distributions. 
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the empirical risk of the chosen / does not approximate 
its expected risk. To estimate the expected risk one has 
to reevaluate the risk for the chosen / with new, inde- 
pendent sample data, the empirical test set, not involved 
in the decision. (Not to be confused with the (true) test 
set D l in the definition of the decision problem.) Bounds 
for the difference between the expected risk of the true 
optimal / and the chosen / are given by the theory of 
uniform convergence (Vapnik, 1982; Dudley, 1984; Pol- 
lard, 1984; Haussler 1995, for an introduction e.g. Kearns 
& Vazirani 1994). These bounds are based on what 
we called structural information. Their local part, like 
bounds for absolute values or the local variance, allows 
locally the application of probability theoretic inequal- 
ities like Hoeffding's or Chebyshev's inequality. Their 
nonlocal part, formulated for example as finite e-entropy, 
or finite pseudo or VC dimension of a set l(q,y,f) for 
/ E F, allows generalization. Then, when the training 
data are i.i.d. sampled according to the test distribution, 
the theory of uniform convergence gives bounds on the 
probability of deviations of the empirical risk from the 
expected risk in the true state /°. For example, (Vapnik, 
1995) gives bounds <$(/, TV) in terms of the VC dimension 
for 

p vc = supp(sup \r(D(N, f°), /) - r(/°, /)| > e\f°), 
f° / 
for bounded risk. The supremum over true states /° 
(with p(f°\f) 7^ 0) is implicit in the results using the 
definition of the VC dimension and can be replaced by 
V/°. Using worst case considerations there is no need in 
this theory to explicitly calculate posterior probabilities. 

There are some recent studies how specific nonlocal 
information affects the bounds of uniform convergence 
(Abu-Mostafa, 1990, 1993a, 1993b; Ratsaby, Maiorov, 
1996). But for general nonlocal information a reformu- 
lation in terms of VC dimension or e-entropy is often 
difficult or even practically impossible, and one has to 
use an upper bound for them, giving the results of the 
theory of uniform convergence another worst case inter- 
pretation. Then, especially when only few local data are 
available, the uniform bounds can be trivial or weak. 

8.2 Vocabulary and framework 

We formulate the Frequentist setting in a decision the- 
oretic language. Assume the availability of a station- 
ary sampling process S to generate training questions 
q E Q D according to some p s (q) for which answers are 
available and the ratio with the distribution of relevant 
questions p(q)/p S (q) is known. We call the set Q s of 
questions with p s (q) ^ the sampling population (for 
questions) and specify for the present context Q D to be 
the set of sampled or training questions with answers 
used for the plug-in estimate. Including previously de- 
fined sets of questions we have the following listing 

1. Q s the sampling population with q E Q s the sam- 
pling questions, 

2. Q D the set of sampled or training questions, 

3. Q° the set of prior questions being the questions 
with data available but with q E Q° not sampled 
according to S or not used for the plug-in estimate, 



4. Q l the set of relevant questions, 

5. Q c — Q l \ Q s the set of cost questions, 

Data (qi, yi) are obtained by using measurement devices 
for qi to find results yi. With respect to a sampling 
process S we separate the available data into the two 
groups: 

1. Sampled data or training examples D , we will 
write more shortly just D skipping the superscript 
S, which are obtained using the available station- 
ary sampling process S and used to calculate the 
empirical risk via the plug-in principle. Nonlocal 
questions can be included in Q s . 

2. Prior data D° , being all other data with questions 
generated from other processes S f ^ S. Such pro- 
cesses can be unknown processes from the past pos- 
sibly different from S, they can be non-stationary 
like for active queries, they can use devices mea- 
suring other questions or represent active control. 
All priors p(f°\f) can be related to a factorial 
prior p faet (f°) = EUxK/°) by data D° with 
p(f°\f(D )) <x p f act(f)p(D \f). The data D° 
are not uniquely defined and the corresponding 
questions need not necessarily to be in the sampling 
population Q s . Data which are sampled according 
to S but not used for the plug-in principle are not 
sampled data but prior data. 

For a Bayesian treatment the distinction is not impor- 
tant, but in a Frequentist approach only sampled data 
are used for the plug-in principle while prior data only 
enter in form of restrictions of the space F. Note that 
this use of the term prior does not refer to the temporal 
aspect meaning information collected previously to the 
data D. We made things simpler by not trying to distin- 
guish non-sampled non-prior data from priors. Thus, 
identification of the reference factorial prior as well as 
assumed and not measured data are understood as be- 
ing part of the prior data ,0°. In another context it 
might well be convenient, but not necessary, to distin- 
guish non-sampled data from prior data with the latter 
having a more temporal connotation or referring to as- 
sumed and not measured data. 

According to this distinction of data we can also split 
the log-posterior into a sampled and prior part 

L(D,D J ) = L D (DJ ) + L (D J ) 



j2L D (yr\ q r,n+L°(n. 



Introducing a dummy index one can always achieve 
Q° Pi Q s = even if for a question sampled and prior 
data are available at the same time. Non-relevant sam- 
pled questions q (£ Q l can be excluded from Q s because 
for them p(q) = 0. If a sampling process nevertheless 
produces them they can instead of simply being elimi- 
nated technically be treated like prior data influencing 
p(f )- The property of being sampled is for them not im- 
portant and we choose in the following Q s C Q l . Includ- 
ing nonlocal questions like smoothness into Q s instead 
of nonlocal prior data could also enable generalization. 
But this requires nonlocal questions also to be relevant, 



i.e. in Q l , what is usually not the case. Also we prefer 
to treat the q E Q l as being independent without priors. 

On the other hand we do not assume all relevant ques- 
tions necessarily to be available for sampling. That al- 
lows Q l D Q s which represents situations where only 
part of the expected risk can be estimated by sampling. 
We call the remaining part costs. Complexity costs are a 
typical example. It must be determined by other infor- 
mations which could come from another sampling pro- 
cess, a Bayesian calculation, or if Q c is finite and deter- 
ministic from a complete set of answers. Fig. 8 shows the 
relations graphically. 

We define prior data for question q to be in the set of 

a. Hints D^ if questions from Q s depend on them, 

that is if X? H X® S = X s ± 0, (Here X? denotes 
the basis of prior question q.) 

b. Cost priors D® if cost questions from Q c depend 
on them, that is if X? D X^° = X c ^ 0. Here X? 
denotes the basis of prior question q. 

A specific prior can be both and therefore the two sets 
need not to be disjunct. 

Splitting the ^-integrations of the expected risk into 
a sampled risk r D with q E Q s , determined by answers 
to q E Q C Q sampled by S, and an additional (non- 
sampled) cost term r° for q E Q c , determined by ,0°, 
gives 

r(f, f) = r D (f(D, D°), f) + r°(f(D, D°), /), 

with r D (f, f) being 



dy / df p(f \D,D )p(q)p(y\q,f°)l(q,yJ), 



and analogous for r°(/, /) with f Qc dq. We write l D 
for the part of the loss function depending on q E Q s , 
that means l D (q,y,f) = l(Q,y,f) for q E Q s and zero 
otherwise, i.e. if q E Q c • Analogously, we write /° for the 
part depending on q E Q c . The parts are only defined 
up to (/-independent terms, because such terms can be 
shifted between questions without changing the risk. We 
can use this freedom for a convenient choice. 

Cases in which available data for questions in the loss 
function are partly sampled as well as not sampled are 
included in this definition by duplicating questions in Q s 
with a dummy index, 

K9,y,f)^i D (q,y,f) + i°(q',i/,f), 

if we define q' E Q c for q E Q s • 

The cost term can depend in general on all data 
including the training examples D. In such cases 
one may call r°(/, /) posterior (non-sampled) costs 
or data dependent (non-sampled) costs as they have 
to be determined after having seen all the data. If 
p(f°\D, D°)=p(f QS \D, D° h )p(f Qc \D°) with Dl n D° c = 

then the f^ s -integration vanishes. Understanding /° = 
/L to be the restriction on Q s we can write 
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r(fJ) = r D (fJ) + r°(f), 



and call r°(f) = /°(/) prior costs. Being (/-independent, 
/° can also be seen as part of l D . 

Examples for possible (prior) costs include, storage 
requirements for parameters of / like the number of 
weights of a neural network or number of nodes in a 
decision tree, penalties for on-line evaluation times, cri- 
teria related to understandability for human experts, or 
to an effective and cheap hardware implementation in 
VLSI technology. 

8.3 Sampling generalized questions 
Re weighting 

Here we consider the case of relevant questions q E Q l 
not directly sampled, i.e. q (£ Q s or equivalently q E Q c . 
We show that for such questions which have a basis X q 
of q completely within Q s theoretically, but often not 
practically, the sampling process S can be extended to 
an S' so that q E Q s • The basic fact used in reweight- 
ing for evaluating an integral by the plug-in principle is 
that the factorization of the integrand into function and 
probability is not unique, 

J dzp(z)g(z) =jdzp'(z)^g(z) = j dz p'(z)g'(z), 

where g f (z) is equal to g(z) multiplied by the reweight- 
ing factor p{z)/p f (z). That means, instead of sampling 
z according to p(z) and summing up g(z{) for each 
sample point z z - we can, assuming p(z) and p f (z) are 
known, alternatively sample according to p f (z) summing 
up g f (zi) = {p{zi) / p f (zi))g(zi) . For example, the method 
of importance sampling (Montvay & Miinster, 1994) uses 
reweighting to reduce the plug-in error by choosing the 
reweighting factor so that the reweighted function g f (z) 
is as constant as possible. But note that the factor p f (z) 
must be a probability and can especially never be neg- 
ative. In the situation we are discussing here, p(z) cor- 
responds to the relevant distribution of test questions 
q EQ 1 with p(q, y\f°) = p(q)p(y\q, f°) while p' (z) is the 
distribution available to generate training data. 

The simplest case of reweighting is when the set of 
test and the sampling population of potential training 
questions are the same , i.e. Q s = Q l , but have different 
probability distributions p s (q) ^ p(q)- Then, if the ratio 
p(q)/p S (q) is known (not necessarily the test and avail- 
able training distributions p(q), p s (q) itself) one can use 
p(q)/p S (q) as reweighting factor for the loss function. 
Thus, the sampling data term of the risk can be related 
to a sampling process S with p s (q) by 

r(f,f)=JdqJdyp s (q)p(y\q,f)V v (q,y,f), 
where the definition of the reweighted loss 



jw 



r(q,y,f) 



p(q) 7, 



p s (q) 



K f hVJ) 



compensates for deviations between p(q) and p s (q) and 
p s (q) ^ according to the definition of Q s . If not stated 
otherwise we understand in this paper implicitly / to 
mean l w or p(q)/p S (q) = 1. 
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Figure 8: Shown is the distinction for data into the two 
subgroups of a. sampled or training data being sampled 
from a stationary process S and b. prior data from any 
other sources. For questions there are 1. the set of rele- 
vant questions Q 1 , 2. the sampling population Q s from 
which training questions are drawn according to a sta- 
tionary p s (q), 3. the training questions (x), and 4. prior 
or non-sampled questions (o). The sets of training and 
prior questions are finite. The part of the loss integra- 
tion depending on Q s can be determined by sampling 
training data. The remaining part of the expected risk 
is called costs and must be determined for example by 
a Bayesian risk calculation, another sampling process or 
using exhaustive queries if Q c — Q l \ Q s consists of a 
finite number of deterministic questions. 
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Only a deterministic function must be included for a 
q E Q l depending deterministically on one q E Q S . But 
for a q E Q l depending deterministically on two or more 
q% E Q S the probability being sampled decreases with 
the product of the probabilities for (independent) qi. 

More general is a probabilistic dependence on answers 
to questions from Q s . Consider relevant questions q E 
Q l given in their dependence on the sample questions q s 



P(y\q,f°) 



dy S / dz S p(q S \y; 



s s 



3) 



xp(y S \q S , f°)p(z s \q s , y S , q)6(q(q s , y S , z S ) - y), 
with the two ^-independent and (/-dependent proba- 
bility factors and the defining function q(q ' ,y ' , z ) as- 
sumed to be known. The probability p(y s \q s , f°) de- 
pends on the state of nature and is unknown. Then elim- 
inating the ^-function by performing the y-integration 
shows that the empirical risk is a sum of 

KQiiVi = q(qf,yf, z i), z i) = KtoiQii yfi z ii z *)> 

when a sampling procedure for the variables is available 
according to 

p(q, z > q s , y s \ z ' s ) = p(q\yc, z c )p(q s \y^,z^,q) 



xp(y s W s J°)p( z 



s *~ s ,y s ,q)p(z\q,y(q s ,y s ,z s ), f). 



Using the corresponding devices the sampling steps 
within the single components are the following: 

q — ► q S — ► y S — ► z s — ► z 

Assuming probability distributions p s (q s | yf , zf, q) — to 
be available to generate training data gives a q-, yf-, 
zf- dependent reweighting factor for the loss function 

p(q S \y^z^q) 

p s (q s \y^ z c) 

As already discussed for the deterministic case it is a 
principal problem for nonlocal questions that the q s can 
form a vector. Take as example a question measuring 
answer differences between points x\ and x 2 = x\ + A 

P(y\q> f°) = dx 1 dx 2 p(x 1 \q)6(x 2 - A - x x ) 

xp(yi\ x iJ°)p(y2\ x 2J°) s (yi -2/2 -y), 

and assume that the X{ are i.i.d. sampled. Here the prob- 
ability to have a complete pair of two values y\{x\) and 
2/2(^2) to insert as y = (2/1,2/2) into the l° ss function 
has measure zero in standard continuous cases (even for 
A = 0) if A is fixed and not integrated over. If the dis- 
tribution is not as badly behaved as the ^-function in 
the example there is still the high dimensionality of x. 55 
That means the amount of available data is usually too 
small to sample such nonlocal questions. 



High dimensionality itself is not the problem if there are 
enough restrictions for the function. The situation is like 
in one-dimensional problems as a theorem from Kolmogorov 
shows that every continuous function of several variables can 
be expressed as superposition of functions of one variable for 
closed and bounded input domain (Kolmogorov, 1957). But 
the nonlocal priors used in the theory of uniform convergence, 
corresponding for example to a specific VC dimension or e- 
entropy, and related to a restricted F, might not be consid- 
ered as weak and harmless in those cases without empirical 
foundation. 



Missing values 

One could think in remedy the situation by using 
guesses for the missing values necessary to answer a gen- 
eralized question. Indeed, the Monte Carlo method or 
plug-in principle for expectations can be interpreted as 
replacing missing values by the data mean. This implic- 
itly assumes the loss function to be constant (or com- 
pensating) for missing values. For example, the empiri- 
cal risk estimate can be seen as minimization of a mean 
square error of the empirical loss with a prior (for the 
loss, not for /°) X(l(q,yJ) - E(l)) 2 , with E{1) the ex- 
pectation of /, in the A — » limit. 

While the plug-in principle replaces missing values 
y x by the global sample mean ^2 x y x , one might wish 
better locally varying approximations of p(y\x, /°). This 
can be done by methods of density estimation, including 
parametric approximations and nonparametric methods 
like splines, kernel and nearest neighbor methods, (see 
for example: Silverman, 1986; Hardle, 1990) or neural 
networks. 

Optimal in a Bayesian sense would be sampling ac- 
cording to the posterior probability p(y\q, f(D)), but 
without reference to a specific model of /° all methods 
are somewhat ad-hoc. 56 But all those approximation 
methods used to replace missing values can be inter- 
preted as an approximation of the posterior p(f°\D) with 
respect to some (implicit) model of p(y\q> /°). Specifi- 
cally, Girosi, Jones, Poggio (1995) relate common inter- 
polation methods (Radial Basis Functions, splines) to 
regularization terms, i.e. from a Bayesian point of view 
to priors within a maximum posterior approximation. 
Also note, that all these approximation methods which 
one can use to replace missing values are itself decision 
algorithms and have therefore, within a Frequentist ap- 
proach, the same problems with general structural in- 
formation as the Frequentist approach for our original 
decision problem. Thus, all the aspects we discuss for 
evaluating the empirical risk for generalized questions 
appear here again for a specific approximation problem. 

8.4 Prior data 

Extended loss, indirect priors, and virtual 
examples 

Common is a situation where prior information is 
available in addition to training data sampled from the 
test (relevant) questions. This information can always be 
seen as corresponding to questions not included in the 
relevant set Q l . An example are priors like an approx- 
imate symmetry and smoothness, being normally not 
sampled and not included in the set of test questions. 57 
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Such methods can also be used directly for the loss func- 
tion. This corresponds to a /-dependent prior on the loss 
function, usually difficult to relate to priors on / . The op- 
timal solution p(l\f(D),f) of this problem in the Bayesian 
framework is already nearly equivalent to the solution of the 
whole decision problem or even more complicated as not the 
whole loss distribution might be relevant for the risk func- 
tional r. 

5 We could define an induction problem as a situation 
where the generating distribution for test and training data 



In those cases the loss function can be extended by 
additional terms to be defined also for data not in the 
test set. This means, the plug-in principle uses an ex- 
tended expected risk r with respect to an extended set 

Q s 

dy dyp(q s \y Ci y c )p(y\q S , f)p(y\q S , f)Kf ' ,V, S), 

with an extended loss function / to approximate the 
(true) expected risk under the state of nature /°. For 
example a smoothness property can be included as ad- 
ditional term. The optimal weight of the additional 
questions, i.e. p(q s ), is usually determined by cross- 
validation or similar methods (see next Subsection). The 
extended loss should be chosen so that the relevant fea- 
tures according to the risk functional r of p(/|/, /) are 
similar to p(/|/, /) but this requires in general a model 
of /° to be determined. In usual approximation prob- 
lems one chooses an extended loss which enforces / to 
answer similar as /° also to the not relevant questions 
as / hoping that this results in similar answers to rele- 
vant questions, too. We will relate this ad-hoc method 
for approximation problems to the maximum posterior 
approximation within the Bayesian approach. We now 
show how extra terms can be interpreted as arising from 
a Lagrange implementation of indirect priors. 

An indirect way to include nonlocal information is try- 
ing to transform our knowledge / into knowledge about 
/. More precisely, if we call the probability of choos- 
ing /, i.e. p(f\f, a, q r ) an indirect prior, one is interested 
in excluding alternatives / with zero (or small 58 ) indi- 
rect prior probability. As the probability of selecting / 
depends in general not only on f(D) but also on the de- 
cision problem, i.e. the q r , including the risk functional, 
loss function, test data distribution, and the algorithm a, 
its complete determination is a much more complicated 
than the decision problem itself requiring the represen- 
tation of / in a model with certain /°. Ideally, we are 
interested in the zero part of the indirect prior resulting 
from the optimal algorithm defined by q r . 



coincide, a transduction problem as a situation where some 
data do not belong to the test set (Vapnik, 1982). In this 
formulation, most practical problems are not induction but 
transduction problems as the nonlocal (e.g. smoothness) in- 
formation is normally not part of the test set. 

Knowledge about the form of the nonzero part of 
p(f\f,a,q r ) is, in principle, of no help in searching for the 
absolute minimum, as the minimization is over all / even the 
unlikely ones, and only the impossible ones can be excluded. 
Exceptions are cases are where (also subsequent) knowledge 
of risk values for some / can be used to exclude others, i.e. 
if nonlocal information about the risk functional can be used 
to exclude certain possibilities. For example in the case of 
a decision problem with known minimal value, or if one is 
not looking for the absolute minimum but only for an ac- 
ceptable minimum, checking / with high probability first is 
of course a good idea. Also in other cases one may ignore 
/ with small probability so that the chance of missing the 
optimum is small. This is a problem on the level of com- 
paring approximations to a decision problem and its analysis 
requires a corresponding risk and loss to be denned. 



For example, in approximation problems with a 
quadratic loss function l(q, y, /) = (y — y(f)) 2 we know 
that the optimal solution is the true regression function 
E(y(f° , x)) if contained in the search space F. Then we 
can simply implement deterministic information about 
the true regression function of /° by the correspond- 
ing restrictions on /, that is if we know p(y\q, f°) = 
S(q({E(y(f°,x)),x E X})) - y) we only use / with 

q(f) = y = y- 

Sometimes, assuming the existence of a state produc- 
ing process with stationary distribution corresponding 
to a possibly unknown but reproducible state of knowl- 
edge one may also empirically estimate indirect priors 
by counting the results of learning algorithms. 59 

While some restrictions for / like the range of possible 
output values or specific symmetries are easily imple- 
mentable, others, e.g. smoothness, are best taken into 
account by using the method of Lagrange multipliers. 
(For the exact conditions under which this is possible 
see for example Bertsekas, 1995.) Formulating the re- 
striction in the form q a (f) — a this means constructing 
an extended risk r by adding the following extra term to 
the risk 

\(a)(q a (f)-a). 

Here A is the a-dependent Lagrange multiplier, the term 
— Xa can be skipped because /-independent, and a, q a 
and therefore A can be vectors. Note that for a given 
problem, including data, a determines A. For unknown 
a determination of a by cross-validation (see next Sub- 
section) can be seen similar to imposing a prior on a. 

Now we shortly discuss how sometimes additional 
nonlocal terms in the risk can be approximated by us- 
ing virtual examples. If q a — f dx p(x\q a )q a (x, f) is a 
sum or integral it might sometimes be easier in practice 
to use only part of the sum if the generalization ability 
is ensured in another way, for example by restriction of 
F. Consider a deterministic f(x) with a symmetry or 
smoothness log-prior 



X(a)q a = X(a) / dx w qa (x)(f(x) - f(sxj) . 

Then one can sample ^-values according to w qa (x) if 
those are positive and can be normalized. This sampling 
can be done independently of the sampling of the train- 
ing examples according to p s (x) (for x = q D E Q D )- 
Thus, the term can be approximated by sampling x and 
calculating f(x) and f(sx). 

For a quadratic loss / (a?, y, /) = (y — f(x)) 2 and 
some b constant with respect to x with b(a)p (x) = 
X(a)w qa (x) the sampling for the data and symmetry 
terms can be combined. As A depends on a so also does 
b. With /° = X(a)q a and splitting the data terms in the 
empirical risk 

f = (1 - b(a))r D (fJ) + b(a)r D (f, /) + /~°. 
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59 Referring to practical experience or literature about the 
use of specific priors approximates this method. 



The integrations over x and y do not affect /° being in- 
dependent of those variables, and we get for the last two 
parts 

6(a) f dx I ' dyp s (x)p(y\x,f) (l D + / c 



) I dx dy P s (x) P (y\x,f°) 



b(a 



xUy- f(sx)Y - 2(2/ - /(*))(/(*) - f(sx)) 



When the second term in the last line vanishes, the 
whole integral can therefore be calculated approximately 
by using virtual (input) examples 60 (sx^y(x)) Abu- 
Mostafa, 1990, 1993a, 1993b ('hints'), Pomerleau, 1991 
(ALVINN), Sietsma & Dow, 1991 (training with noise in 
practice), Vetter, Poggio, & Biilthoff, 1992 (virtual views 
of an object), Girosi & Chan, 1995 (for RBF), and more 
theoretical Webb, 1994, Leen, 1995, and Bishop (1995ab) 
who gives the explicit form of regularization terms, for 
quadratic and for cross-entropy error functions, for in- 
finitesimal translations, by expanding f(sx) in a Taylor 
series around f(x). When f(x) ^ f(sx) then all non- 
optimal f(x) are not equal to the regression function. 
For those f(x) the second term does not vanish and has 
to be considered in the actual minimization procedure. 
But when (f(x) — f(sx)) is zero or the optimal regres- 
sion function is in F the second term vanishes at the 
minimum. 

Stratification: (cross— )validation and structural 
risk minimization 

A trivial toy example may clarify the basic idea: As- 
sume two deterministic questions x\ and X2 and a set 
of possible answers Y = {0,1} corresponding to the 
four possible functions fi characterized by their answers 
(/■(*i),/i(z 2 )): A : (0,0), / 2 : (1,1), f 3 : (1,0), 
/4 : (0,1). Let us sample data D until we have an- 
swers to both questions, for example x\ — and X2 = 1. 
Clearly, f^ would minimize the mean square error. Now 
we perform the same minimization in a hierarchical way. 
We form the two groups (strata) of smooth functions 

Si = {/l,/^} and non-smooth functions S2 = {/3J/4} 
and use the first data point, for example x\ — to mini- 
mize within the two groups finding fi and f^. To decide 
between the optima of the two groups we sample more 
data until we get X2 = 1. Again, we choose f^. Note 
that also forming non-disjunct, overlapping groups, e.g. 
Si C Si+i, would lead to the same result as long as they 
include all four functions. However, a stratified search 
is not always equivalent, to a full search. The difference 
is, that the new data are only used to decide between 



60 In this special case the input sx is newly generated, while 
the same target y for x is used again for sx. We also used 
the term virtual examples in the Bayesian approach for sam- 
pling from the posterior probability p(y\q, /) where for risk 
minimization both, q (e.g. x) and y, are generated according 
to the posterior. 



'winners' of the strata, and 'loosers' are not reconsid- 
ered again. They may however better fit the complete 
data. 

In the case of too large sets F the difference be- 
tween the minima of the expected (training) risk for 
the empirically chosen and the optimal / can become 
too large (Vapnik, 1982) and solutions can depend too 
strongly (non-continuously) from the data (Tikhonov, 
1963). Than it is necessary to restrict the minimization 
to a simpler subtask (regularization). 

Subsets or strata are defined by some deterministic 
question q a requiring q a (f) — s. Then we search for the 
minimum first within the strata and compare the best 
solutions of different strata in a second step. We want 
to select q a so that minimization within each stratum s 
is possible 61 . Selecting a specific stratum is equivalent 
to implementing an indirect prior, discussed in the last 
paragraph. Practically, it can be done by direct restric- 
tion of function values like using hardwired symmetries, 
or, more indirectly, by restricting parameter sets. Exam- 
ples include choosing the number of nodes or the initial 
values of the weights of a neural network or the learning 
algorithm used. This can create overlapping strata, but 
this is no principal problem and only means that some 
solutions are considered more than once in the search 
process. (This is indeed the normal case for nonlocal q a 
like smoothness, where the value of q a can be increased 
without changing any function value at a data point.) 
All these constraints are easily kept constant during the 
minimization (learning) process. 

Some constraints like smoothness might be difficult 
to implement directly. Then again, this minimization 
with constraints can be done by the method of Lagrange 
multipliers producing an extra term to be added to the 
empirical risk 

*(*)(?„(/)-*)• 

The only difference to the preceding paragraph is that 
the optimal value of A is not yet determined. The com- 
parison between strata to find A (and therefore s) has to 
be done with an independent data set. That means we 
have to separate the available training data in a train- 
ing set to be used within the strata and a training set 
to decide between the strata also called validation set 62 . 
Repeating the procedure several times with the same 
total data set but with different splittings into training 
and validation set is called cross-validation (Stone, 1974, 
1977, Allen, 1974). A stratification method for strata 
where the VC dimension can be calculated is the method 
of structural risk minimization where the validation step 
is replaced by minimizing the worst case empirical risk. 
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Low VC dimension or e-entropy in the theory of uniform 
convergence or compact for a positive real 5 and continuous 
one-to-one extremal equation in regularization theory. See 
(Verri & Poggio, 1986) for examples for q a . The existence of 
a stable, unique solution is the result of practical interest in 
regularization theory. For finite, noisy cases the convergence 
theorems are not of so much practical value. 

The estimation of the true expected risk after the cross- 
validation procedure would require a third set of available 
data, the empirical test set. 



Fig. 9 illustrates that the selection of proper strata de- 
pends on indirect priors for the example of structural risk 
minimization. (See also Wolpert, 1994b, Ripley, 1996) 

In general the minimization within some specific 
strata (e.g. for those strata including functions where 
a given smoothness functional is not even defined) or 
also between strata can be very difficult or even impos- 
sible. To make those problems solvable an indirect prior 
for / must be available to restrict the number of strata 
and also to enable minimization between strata. Prac- 
tically, also the maximal number of different X(s) which 
can be considered is normally at least restricted by the 
available computational resources. Therefore, there is 
more what choosing a good stratification variable q s can 
do. It can make the minimization problem between the 
strata easy (for one or more algorithms under considera- 
tion). This is an algorithmic specific aspect not directly 
related to prior information about /°. It means that 
Qa(f) approximates already relevant /-independent as- 
pects of r(/, /). And indeed the most figures found in 
the literature plotting the empirical error against, for in- 
stance, some smoothness related s or X(s) show a very 
simple one-minimum structure. In principle, even for 
restricted range of values for s such a function could 
look arbitrarily wild having for example a multiminima 
or even random-like structure. 

If information is available which restricts the range 
of s this is a conceptual clear case of an indirect prior. 
However what one usually does is more like the following: 
begin with a starting value sq and explore every of its 
components in a given direction until the first common 
minimum is found and stop there. That is, one assumes 
a certain form of s-dependence of the risk r(/, f*(s)) for 
the /* optimal for s. 

8.5 Bayesian interpretation of the Frequentist 
approach 

8.5.1 Approximation and non— approximation 
loss 

We defined an integrated loss function l(q,y,f) = 
fdyp(y\q,f)l(q,y,y) for test questions with p(q\y c ,z c ) 
— p{ f l\yc) or simply p(q) as we will choose for the Fre- 
quentist setting. This integrated loss function has the 
same arguments as a log-posterior, except that / and /° 
are exchanged. 

To interpret the Frequentist approach from a Bayesian 
point of view we choose the same parameter space for F 
and F° . This corresponds to a one-to-one mapping, 
which we call parameter mapping, between F and F° 
which we can use to identify 

p = f°. 

Specifically, f° = / o V? : /"(«).= /(«) when 
/° can be parameterized by a deterministic function 
f°(q)- The optimality mapping for the risk functional 
f(f°) =argmin /r(/°, /) defines another mapping be- 
tween F° and F. We already discussed that a one-to- 
one optimality mapping is obtained if /° which lead to 
the same optimal / can be identified, and by excluding 
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Figure 9: The VC horizon: Consider a set F with a sub- 
set S of infinite VC dimension. If the sequence of sets 
Si C S2 C S3 C S4 C • • • S n C S does not contain the 
optimal /* (marked by x in the figure) then the VC di- 
mension of S n becomes infinite for n — » 00 before the 
optimal /* is found. On the other hand, choosing for 
example Si = {/*} to be (or to include) /*, the optimal 
solution is already found in the first step. The final uni- 
form bound is lower, if /* is found earlier in a smaller 
S{. The VC bound of Structural Risk Minimization de- 
pends therefore on the choice of the sequence Si. The 
probabilistic aspects of this choice, however, do not en- 
ter the VC bound. Strictly deterministic, i.e. uniform, 
prior information only defines the set S and not a chain 
of sets Si . Here prior information p(f) has the form of 
an indirect prior, i.e. it is about optimal actions and not 
states of nature. (Only in saddle point approximation 
for approximation problems, i.e. / oc — In P + c where we 
can identify /° with /, this is equivalent to a direct prior 
p(f°) for state /°.) If F is large, a prior free construction 
of the chain Si by uniform sampling will with high prob- 
ability not contain good candidates. To yield reasonable 
results the sequence of Si has to be chosen depending on 
the probability that Si contains the optimal solution, i.e. 
depending on probabilistic prior information. Indeed, if 
we can attribute to most / very low probabilities to be 
the optimal solution, this increases the chance of testing 
good candidates. 
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/ which are for no /° optimal. For such a construction 
the /° related to the / by the optimality mapping can be 
seen as effective states of nature for the decision problem 
involving /. 

We called a loss function approximation loss if for all 
/= f (See Section 5.3.4) 

K<l>yJ) = -ciL D (y\q,f)-c^ 

with /-independent constants Co, c\ adjusting the value 
of the minimum of / or the normalization constant of 
L D . We remark that an ^-independent cq corresponds 
to a fixed normalisation of l A over y for all / and all q. 
Equivalently we may say in this case 

K9,y,f) = i A (q,y,f), 

if we define the approximation part of the loss as 
l A (q,y,f) = -c 1 L D (y\ q ,f )-c . 

In general this gives a (ci, co-dependent) decomposition 
of the loss function into an approximation and non- 
appro ximati on part of the loss 

l(q,y,f) = l A (q,y,f) + l NA (q,yJ). 

Analogously, we define the (loss function dependent) 
set Q A of approximation questions to consist of all ques- 
tions q for which we can choose for all / = /° and all 

y 

K<i,yJ) = -ciL D (y\<i,f )-co- (20) 

To achieve a parallel notation for loss and log- 
probabilities we will use in this case the convention 

l A (<l,y,f) = l D (<l,y,f) for Q € Q S \ i- e - if possible we 
choose the sampled loss equal to the approximation part. 

For those questions a large log-posterior for /° is 
equivalent to a small loss for /(/°) which characterizes 
the situation for q in a parallel decision problem as an 
approximation problem. In an inverse setting this is not 
necessarily true as for example a loss l D — (q — f(y)) 2 
measuring the reconstruction quality of q = f(y) would 
correspond to a log-posterior L D — (q — f°(y)) 2 instead 
oiL D = (y-f°(q)f. 

The decomposition of the loss function into an approx- 
imation and non-approximation part of the loss induces 
the same decomposition for the risk 

r(fJ) = r A (fJ) + r NA (fJ). 

Especially common is the case with prior costs and Q s C 
Q A for which we have (using our convention for l D in 
such cases) 

%,!/,/) = i A (q,y,f) + i NA (f) 

= l A (q,yJ) + l°(f) 

= l D (q,yJ) + l°(f). 

Prior costs have the form of a fixed additional term 
implementing restrictions within F according to the 
method of Lagrange. Table 3 summarizes decomposi- 
tions of loss, risk, and log-posterior: 



Decompositions 



Sampled vs. Non-sampled 



Log-posterior 
risk 



L(D, D\n = L v {D,n + L\D\n 
= Y.iL D {vMi,f°) + L°{D\f°) 

r(fJ) = r D (fJ) + r (fJ) 



Approximation vs. Non-approximation 



risk 
loss 



r(f,f) = r A (f,f) + r» A (f,f) 

i(q,yJ) = i A (q,yJ) + i NA (q,yJ) 



Special case: Q C Q with prior costs 



loss 



Kq,y,f) = i A (q,y,f) + i°(f) 

= l D (q,y,f) + l°(f) 



Table 3: Decompositions of log-posterior, loss function, 
and risk 



We now analyze the MaP-MiR procedure. To find 
the optimal /* in a MaP-MiR approximation the first 
MaP step is followed by a second MiR step to find /* = 
argmin f -r(/°, /) for the most probable state /° = /°'*. 

Despite the one-to-one mapping between .P and F and 
the same functional dependence of L D and —l D for 
q E Q A there is no perfect symmetry between the two 
problems as the maximization is done with respect only 
to a finite sum over the set of data D and the min- 
imization with respect to the full y-, and (/-integrals 
conditioned on the pure state / 0> * with maximal poste- 
rior probability. Only if the mapping between / 0> * and 
/(/°'*) already corresponds to the optimality mapping 

f(f°'*) = f* =argmm / r(/°>*,/) 



rgminj / dq dyp(q)p(y\q, . 



/"'*)%,!/,/), 



with 



p(y\qJ n = e LD( ^°^ 



then finding the state with maximal posterior probability 
already corresponds to minimizing the loss for it. 

We already saw in Section 5.3.4 that for approxima- 
tion problems the relative risk 

ci\r(f°,f°) - r(/°, /)) = K(p(y\q, f°),p(y\q, /)), 

is a Kullback-Leibler entropy and using Jensen's inequal- 
ity that the optimal solution for the minimization step 
is 

l A (q,y,h = -ciL D (y\q,f°n-co. 
We derive this well known result here again using an 
explicit calculation. For continuous parameter spaces 
/, a necessary optimality condition for minima not on 
the boundary is a vanishing gradient with respect to the 
parameter vector / 

dr(f°>*J) 



df 



1* 



0. 



We study the stationarity condition 63 for that part of the 
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In case the parametrization of / (or /, respectively) does 
not ensure normalization, one has to add the normalization 
condition \(x)(l — J dyp(y\x,f )) for all x, where X(x) is a 
^-dependent Lagrange multiplier. 



loss integration which depends on approximation ques- 
tions, i.e. with q l E Q A , 



dfjQAQQl 



dqj dyp(q)e LD M^°^l D (q,yJ)\ =0. 

(21) 
Interchanging integration and differentiation and using 

- Cl L D (y\q, f°) - Co = P(q, y, /) at f<* = /* gives 



/ dqp(q) 



dye 



L u {y\q,S -*) 



o„ ] dl D (q,yJ) 



df 



r 



-ci 



mm u^^^^mn 



f°>- 



-ci^o / dqp(q) I dye LD ^n 



~ Cl dp / dqp ( q> ) / d yp(y\v>f°) 



i°>- 



/0) * 



-ci- 



-1 = 0. 



That means that for Q l — Q A the stationarity con- 
ditions of the MiR step are automatically fulfilled if 
F — F° . 64 (Here we discussed the saddle point approxi- 
mation, however the same result holds in a full Bayesian 
approach with respect to / (not /°). if In p(y \q, f) = 
-cj(q,yj) + c and F = F.) For Q l = Q A the 
MaP-MiR solution is not altered if p(q) is changed 
(for fixed data). In particular, terms like r A (f°,f)) 
= Jdqp(q)Jdyp(y\qJ )I D (q,yJ), with l A (q,yJ) = 

l D (q,y,f) =-ciL D (y\q,f°)-c for every / = /°, can be 
added or dropped from the risk (with p(q) readjusted). 65 
As long that term is /°-dependent it cannot be seen as 
part of the (/°-independent) loss, i.e. it cannot be writ- 
ten as /(/). In this situation approximating one point 
is equivalent to approximating a whole function. This 
could also be done for the prior part L° writing it in 
its data dependence, giving rise to cost terms /°(/ ,/). 
The non-approximation part of the loss depending on 
questions q E Q l \ Q A can cause a deviation between 
optimality mapping and parameter mapping. Examples 
of typical non-approximation loss include 

1. time and storage requirements for calculating /, 

2. costs producing a hardware (VLSI) implementatio 
a general I(q, y, /) (energy function)n of /, 



For example minimizing a L\ error (sum of unsquared 
distances) for Gaussian probabilities the MiR condition is 
not automatically fulfilled. An example is the support vector 
machine (Vapnik, 1995) for classification which uses a L\ 
type of error in the non-separable case (which might e.g. 
result from noise). There the function to minimize consists 
of two terms, the norm of the weights of the optimal canonical 
hyperplane ||w>|| and the L\ error, whose relative importance 
C has to be determined for example by cross-validation or 
prior knowledge. (Individual C % for each data point i could 
account for locally differing variances.) 

65 One could paraphrase this observation by: "One always 
can require what is already there." 



3. understandability of the structure of /. 

We remark that the normalization 

condition f dyp(y\q, /°) does not allow to choose for a 

general /(<2,y, /) L = c\l -\- cq with /° independent cq. 
As discussed earlier non-approximation loss related to 
the algorithm instead of / define a higher level decision 
problem. 

The typical example for approximation questions 
combines Gaussian noise with mean square error. Also, 
a question is an approximation question if a uniform L 
is combined with a uniform / finite on the same do- 
main. Here log-probabilities L (and analogously costs 
/°) are called uniform on .P (or F) if they are equal 
to a constant, i.e. independent of / (or /°), on the do- 
main where they are finite. An example are priors im- 
plemented by regularization terms to be determined by 
cross-validation with restricted interval for the regular- 
ization constant. Uniform priors can be skipped from 
the formalism by restricting the parameter spaces .P to 
the domain on which the log-priors are finite. The corre- 
sponding /(/°) related to /° with zero prior probability 
by the optimality mapping can also be skipped. This is 
equivalent to the introduction of a uniform cost term. If 
their is no non-approximation part of the loss the opti- 
mality mapping is equal to the parameter mapping and 
trivially implemented by using the same restrictions for 
F° and F. 

While we saw that risk (not necessarily loss function) 
terms corresponding to log-likelihoods can be added to 
the risk integral without changing the problem, (non- 
approximation) loss terms on the other hand cannot sim- 
ply be implemented in the log-posterior. 66 For example, 
uniform prior costs, equivalent to a restriction of F, do 
not lead directly to a restriction of F° . In principle one 
could regroup the F° by forming equivalence classes with 
respect to the restricted F, but this may destroy the re- 
lation (20). 

In general, for the MaP step being sufficient 



dr NA (f^J) 



df 



f=f° 







must hold, 
this reads 



For prior costs /°(/) being /°-independent 
dl°(f) 



f=f° 



df 
is approximation loss. This allows 
/°(/) has a minimum / = /* at the 



o 



/' 



0,* 



for every f° if l D 

special cases where 

maximum/ = f°>* of the posterior L D + L 

/* , and l D (/) coincides with an approximation risk term 

r^(/ '*, /). Then, even nonuniform costs do not change 

the optimal /. However, this can only happen for specific 

/°, as /°(/) is /°-independent and we assume the MaP 

estimate to be data dependent. Otherwise calculating 

the MaP approximation would not be interesting. Thus, 

requiring that, depending on the possible data, every /° 
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66 In contrast to the previous footnote this could be para- 
phrased by: "Reality is not always like one wants it to be." 



can be the most probable one, i.e. be a MaP estimate 
/°'*, the derivative of the cost term must be zero at every 

f = /0) * 

dl°(f) 



df 



0. 



Then, /° is uniform. Because nonuniform, data or /°- 
independent costs belong to the non-approximation part 
of the loss their presence require the full two step MaP- 
MiR procedure. 67 An example would be using complex- 
ity costs to enforce simplicity of a model / (e.g. smooth- 
ness, sparseness, integer values) independent of the data, 
maybe even when knowing that this is not true for nature 
/° (which might for example allow real values). 

8.5.2 MaP-MiR and ERM: Priors and prior 

costs 

Often, complexity related prior costs, are included in 
empirical risk minimization, either by explicit penalty 
terms, or by choosing a specific structure for hypotheses 
/. If those complexity aspects are requirements not re- 
lated to priors ERM cannot be interpreted as MaP-MiR 
procedure. For example, a tree classifier might be cho- 
sen because it can be obtained rather effectively and/or 
because the resulting rules are relatively easy to inter- 
pret. If, there is no prior knowledge about F° having 
tree structure, and at the same time there is a more 
appropriate parameterization of F° available, then this 
one should be used in the MaP step while a tree classifier 
could be fitted in the MiR step. 

We use the results from the previous paragraph to dis- 
cuss in more detail the relations between the Bayesian 
and Frequentist point of view of empirical risk minimiza- 
tion in the presence of priors and prior costs (in the 
following shortly called costs). Consider the three prob- 
lems: 

1. Maximization of the posterior probability (MaP) 
given data D 

argmax /0 (f^ L D (yf \qf f) + L°(f)\ , 

2. empirical risk minimization (ERM) given data D 

argmin; [\JZ~ l ° («< < & < /) + 'V") ) , 

being a sample estimate of 

3. a full Bayesian risk minimization (MiR) for fixed 

f° 



argmm 




dyp(q)p(y\qj u )l lj (q,yj) + l u (f) 



Within the MaP-MiR procedure the fixed f° for the MiR 
step is the result / 0> * of the MaP step while for ERM 



6 This means as soon as one knows model / and nature f° 
are not the same one should use two different descriptions for 
both. Otherwise the knowledge about their difference cannot 
be used. 
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/° is thought to be the true state of nature. We as- 
sume the sampling set to consist of approximation ques- 
tions i.e. Q s C Q A or equivalently —c\L D (y\q, f°) — CQ = 

l D (q, y, f) for q E Q S , choosing in the following c\ — 1 
for simplicity. Consider the following cases in which ar- 
bitrary constants c' , Cq exist, so that 

A: ('uniform costs and uniform priors') 

J°(f)+ci = -L (f°) + c>i = 0, 
B: ('uniform costs but nonuniform priors') 
l°(f) = c' , L\f)±Jl, 

C: ('nonuniform costs') 

D: ('costs oc — priors') 

nP(f) + L°(f°) = c' , 

In case A we could choose /° = by including the 
constant into l A — l D . It is the case with all priors and 
related prior costs already implemented as restrictions of 
F° and F . Here numerical realization of ERM and the 
MaP step are identical and their interpretation from the 
Frequentist and Bayesian point of view are fully compat- 
ible because a MiR step is not needed. F° and F can 
be fully identified and there is no need to use a distinct 
notation for /° and /. 

In case B we also can choose /° = so no MiR step 
is necessary, but the MaP step uses log-prior terms if 
available. As those terms are not part of the loss func- 
tion ERM should in principle ignore them. Namely, the 
nonuniform parts of priors do not enter the procedure 
of ERM or the (worst case) bounds for uniform conver- 
gence. Using the priors nevertheless as cost terms in 
ERM leads to complete numerical equivalence with the 
MaP step and therefore the whole MaP-MiR procedure. 
In this sense priors for Bayesians are related to costs for 
Frequentists. Note, however, that while here the numer- 
ical calculations coincide, they are interpreted as differ- 
ent models of nature, as costs cannot be identified with 
priors. 

Case C requires the MiR step. Therefore, in this case 
the two step MaP-MiR and the one step ERM differ. Ex- 
ceptions are MaP results / 0> * for which the nonuniform 
costs have a minimum. (For uniform costs all / E F 
are minima.) MaP-MiR incorporates priors and takes 
into account the differences between F° and F. This 
might be especially important if they differ strongly be- 
cause priors and costs are related to different aspects. 
MaP-MiR is expected to improve ERM in situations 
where the prior had a strong influence compared to the 
sampled data in the MaP step and/or the cost term is 
substantial compared to the data term and related to 
aspects different from those of the prior. The same re- 
marks apply when the sampled loss l D (y\q, /°) itself is 
non-approximation loss (this means according our con- 
vention it cannot be chosen as approximation loss), i.e. 

Q s £Q A . 



Case D seems to show a perfect symmetry between 
MaP and ERM. Indeed, under these conditions ERM 
and MaP are numerically identical for the specific n 



(one-dimensional) Gaussian probability 



argmax^o >^ 



L D (yi\qiJ°) + L (f ) 



= argmin; 1 1 ]T P (q { , Vi , /) + /° (/) j . 

But in as far as nonuniform costs are present the MaP 
step is not sufficient from a Bayesian viewpoint and the 
MiR step is missing. Indeed, for case D the MiR step 
takes into account the same function L°(f°) again but in 

form of costs /°(/) so the related aspects have a stronger 
influence in MaP-MiR than in ERM. This means, that, 
if not already in the minimum, a function /* ^ f°>* 
can be chosen which a lower /°(/*) than /°(/ = /°'*) 
of the result / 0> * of the ERM or MaP step. One may 
expect this effect usually to be small in practice, as the 
saddle point approximation assumes a strongly peaked 
maximum for L D -\-L° , usually arising in the limit n — » oo 
where case C becomes case B or, for uniform priors, case 
A. 

The theory of uniform convergence does not require 
all the conditions necessary for a Bayesian interpretation 
of ERM as long as the training data are sampled accord- 
ing to the (arbitrary) relevant distribution. It applies for 
general / ^ —c\L D — cq and also if F and F° are cho- 
sen different, like in many examples of computational 
learning theory. Costs restricting the search space F, 
and changing for example its VC dimension, influence 
the bounds of the theory of uniform convergence. Its 
bounds do not depend on the form of the finite parts of 
log-priors L° as they are worst case considerations. Prior 
costs /° do not contribute to the difference between em- 
pirical and expected risk (if they are not sampled itself), 
as they are data independent and included in both. The 
bounds only depend on the infinite part of prior costs /° 
restricting the space F which can for example reduce the 
VC dimension. But in most of these cases ERM differs 
in method and results from a MaP-MiR approach. 

We can summarize the results by saying that for 
F° = F uniform costs allow an interpretation of the 
numerical ERM procedure as two step MaP-MiR proce- 
dure for approximation loss so that the MiR step is au- 
tomatically satisfied. On the other hand, training data 
being sampled according to the test data distribution al- 
low application of the bounds of the theory of uniform 
convergence to the results of empirical risk minimization. 

For example, take a prior p(f°) depending on a sym- 
metry property like S(x, /°) = (f°(x) — f°(—x)) 2 . If we 
choose a uniform prior which is zero for /° if S(x, /°) is 
above a bound B and constant for those /° below that 
bound, we can implement the prior as restriction on F° 
by excluding /° with S(x,f°) > B. Through the opti- 
mally mapping this corresponds to the same restriction 
for the / E F . This is the usual case. 

Let us briefly write down the common example of a 



P(y\qj°) 



1 



(y-/°(g)) 2 



/2tt 
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namely a log-posterior L(y, f°(q)) = Y\i<j q y2T{ — 
(l/2)a~ 2 (y — f°(q)) 2 corresponds to a quadratic loss 
function l(yj) = (1/2)<t ? - 2 (i/ - f(q)f with f(q)- 
independent cr(f(q)) = cr q skipping the constant. Note, 
that f°(q) is no random variable but parameterizes the 
states /° and corresponds to the regression function. As 
the regression function minimizes the mean square error 
we have /(/°) = argmin f -r(/°, /) or written explicitly 
the optimality condition (21) gives 



dyp{y\q,f)y = f(q). 



because da q /df(q) = if all / have the same cr(f(q)) = 
(T q . Inserting the form of p(y\q, f°) and performing the 
Gaussian integration we find again that the optimality 
condition is fulfilled. As f(q) represents the regression 
function all deterministic information about the regres- 
sion function can be incorporated as indirect prior, that 
is by restricting the search space F . This holds in partic- 
ular for a deterministic bound on the smoothness of the 
regression function. On the other hand for nonuniform 
costs, like for example higher costs for states with re- 
gression far from zero /°(/°) oc ^ /°(^) 2 , the two step 
MaP-MiR approximation is not equivalent to an empiri- 
cal risk minimization even if we choose an prior L° = 1° . 
Fig. 10 visualizes some of the relations. 

We summarize how the classical Frequentist approach 
of empirical risk minimization with additional (e.g. regu- 
larization or penalty) terms can be interpreted as a spe- 
cific Bayesian model. This 'classical' Bayesian model has 
the following specifications: 

1. Definition of an effective loss function / for z- 
independent generation of test questions. 

2. Identification of the (parameter) space of actions / 
with that of states / . 

3. The same function, up to a factor and a constant, 
is chosen for the y-dependent parts of the effec- 
tive loss l D (q,y,f) and log-likelihood L D (y\q, f°) 
depending on the same variables after identifying 
/ with /°. (We use a formulation where sampled 
data correspond to approximation questions, i.e. 

Q s c Q A .) 

4. There are no nonuniform costs /°(/). 

5. The decision relevant risk functional is the expec- 
tation functional (Bayesian expected risk). 

Under these conditions empirical risk minimization cor- 
responds to an exact risk minimization (i.e. no plug-in 
estimate) for the state with maximal posterior probabil- 
ity, regardless of how the training questions have been 
sampled. 

Remarks on point 3: 



i. It characterizes the situation as an approximation 
problem in a parallel decision setting. 

ii. Note that AND and OR are exchanged in the fol- 
lowing sense: When y\ AND 2/2 has been observed 
as training data then we assume y\ OR 2/2 can ap- 
pear in a test situation and both log-posterior and 
loss consist of a sum. If we only know y\ OR 2/2 
could have been the training data this would re- 
quire adding probabilities and not log-probabilities 
resulting in a non-additive structure for L. A non- 
additive loss function depending on more than one 
outcome at a time would have the different inter- 
pretation of an interaction of losses for repeated 
tests, i.e. for cases where y\ AND 2/2 happens. 

In physics mean field approximations and classical ap- 
proximations of field theories are related to saddle point 
approximations. The relation between the Frequentist 
approach as a maximum posterior approximation and 
the full Bayesian approach is for example similar to the 
relation between classical physics and the path integral 
formulation of quantum mechanics with a field being the 
analogon to a pure state /°. 

8.5.3 MiR perturbation theory 

The MiR step requires minimization of the expecta- 
tion </> = </+ / > under the distribution given 
by the MaP f°>* . For a small enough 'perturbation' l NA 
we may expand l{q,y, /) around l{q,y,f — /°'*). To 
change the location of a minimum we have to go at least 
to second order 

l( q ,yJ)Kl(q,y,f >*) 



+(/-/' 



0,* 



)-^Kq>yJ)\f = fo,+ 
df J J 



-kf-f°n 2 ^K^yJ) 

1 df 2 



f=f°>' 



Normalization is assumed to be ensured by the param- 
eterization of /°, otherwise for example a Lagrange 
multiplier can be added. The stationarity condition 
—j < K^^y^f) > = i s linear in second order. It gives 

df 

for the parameter vector / the solution, assuming c\ — 1 



r-f 



^ d ]NA\ ^ 
^ df 1/=/°'* > 




^ d 2 Ja\ \d 2 JNA I 

^ df 2 ' 1/=/°'* ^ df 2 ' \f=P>* 

^ d JNA\ ^ 
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/ / d JA 1 \2 1 d 2 JNA 1 ^ ' 
^ y d f l \f=fo,*) ■+■ d p l | /=/ o,* > 
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Figure 10: Numerical evaluation of the full Bayesian 
risk, for example by Monte Carlo methods, is techni- 
cally the same as an empirical risk minimization (i.e. a 
use of the plug-in principle) for virtual data generated 
by /• (One might call this also a virtual (empirical) 
risk, however it is a real empirical risk if state / is pre- 
pared as a mixture of /° according to p(f° |/).) A saddle 
point approximation of the full integral gives the MaP 
problem, which requires to find the state / 0> * with max- 
imal posterior probability. If the prior log-probability 
L° contains integrals (from nonlocal data) this may re- 
quire a numerical evaluation of L, i.e. use of the plug-in 
principle. In the 'classical' model with uniform priors al- 
ready implemented in F° = F and approximation loss, 
empirical risk minimization using the sampled data is 
equivalent to finding /° with maximal posterior proba- 
bility. The bounds of the theory of uniform convergence 
require sampling according to the relevant distribution 
p(q J ), which, however, can be arbitrary. For approxi- 
mation loss, i.e. / = —c\L D — Co, the MiR step is not 
necessary. 



where < • • • > stands for the y and q integrals and we 
used J dye L = 1 to get the second line, and therefore for 

Oand J•d 2/ e i (^- 
=/ o,*_ 



Cl 



1 we have / dy e L -j~l A 

df 



{jjl A f ) = at the location / : 
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8.6 Occam's Razor 

Here is a good opportunity to discuss the celebrated prin- 
ciple of Occam's razor. Occam's razor states: If two the- 
ories explain the same phenomena equally well choose 
the simpler one. From a Bayesian point of view this 
simply corresponds to including complexity (prior) costs 
in the decision, but is sometimes also interpreted in the 
version: Simpler theories have higher prior probabilities. 

This can be explained by the fact that empirical risk 
minimization has no possibility to include / dependent 
complexity in form of (prior) costs /°(/) independent of 
the priors L(f°). For uniform priors corresponding to 
uniform costs (case C) Occam's razor is automatically 
implemented in both versions. For nonuniform costs a 
second risk minimization step has to be included or if us- 
ing a one step procedure Occam's razor has to be imple- 
mented via the priors. In contrary to the first case where 
the prior and cost versions of Occam's razor are equiv- 
alent, in the second case the Frequentist and Bayesian 
interpretations of the extra terms do not coincide. Then 
the prior version of the razor is the Bayesian interpre- 
tation of what appears for the Frequentists as the cost 
version. Relying on the cost version as the intended one, 
a Bayesian approach approximated by a two step MaP- 
MiR approximation differs from a one-step Frequentist 
approach. 

In principle, within a Bayesian framework the concept 
of costs and priors are independent. 68 A MaP-MiR ap- 
proximation is possible with arbitrary costs independent 
of priors, when such costs are included in the second risk 
minimization step and the MaP step is still justified. For 
example, a sparsity constraint can come from the imple- 
mentation of complexity costs related to computational 
costs for nonzero numbers or from the prior information 
that the actual data are produced by a small number 
of prototypes. Both aspects cannot be modeled inde- 
pendently using ERM but could be taken into account 
using the two step MaP-MiR procedure. 

9 Stationarity equations 

9.1 Data for generalized questions 

In this section we study the stationarity equations 
(or mean field equations) to find extrema of the log- 
posterior. There are a variety of methods to find an ex- 
tremum of the posterior probability. If those are nonlin- 
ear they have usually be calculated by iteration. Gradi- 
ent based methods, or EM (Expectation-maximization) 



68 There are justifications of a relation between complexity 
and priors also for the Bayesian framework (MacKay, 1992a). 
The idea is that complex states appear usually in much more 
variations than simple ones and if their total probability is 
fixed their individual probabilities become small. Such a cou- 
pling between prior and complexity results from using uni- 
form priors for specifically selected groups of states, and a 
grouping may be seen as more or less natural for specific sit- 
uations. The problem of denning uniform priors in a such 
hierarchical situation is similar to the situation for contin- 
uous variables where uniform priors do not remain uniform 
under general transformations. 



related algorithms (Dempster, Laird, Rubin, 1979, Tan- 
ner, 1993, Gelman, Carlin, Stern, Rubin, 1995) are spe- 
cial iteration schemes. In what we called a classical 
Bayesian model, with no additional nonuniform costs 
and F° = F, log-posterior maximization (MaP) already 
includes risk minimization (MiR). But a Bayesian MaP- 
MiR approximation is not restricted to a classical model. 
Assuming that the MaP step yields a good approxima- 
tion of the Bayesian integral, performing a second in- 
dependent risk minimization step after maximizing the 
posterior allows 

1. states /° independently modeled from actions /, 
that is F° ^ F, 

2. arbitrary costs /° independent of priors L°, 

3. an inverse setting with states defined by p(y\q, /°) 
and inverse actions p(q\y, /). 

Now we discuss in more detail the case of Gaussian 
basis questions and states /° parameterized by their re- 
gression functions. 

Let us now have a sharper look at the MaP approx- 
imation in the case of nonlocal questions. We calculate 
the functional derivatives separately for different classes 
of questions. We choose one-dimensional Gaussian dis- 
tributed basis questions X 

p(y\xj u ) oc e *\ ** ) . 

Thus in this section we assume the /° to be parame- 
terized by their regression functions y x — y x (f°)- 69 We 
Set 



fO 



L(y\x',f) = —L(y\x',f)<x6(x-x')<r- 2 (y-y x ). 

uyx 

For general local Gaussian questions including differing 

variances 

i ( y-yq x (yx) \ 2 
p(y\q x J ) oc e 2 v * qx J } 

with y qx {y x ) being a deterministic function of y x , we 
have 

-7—L(y\q x/ J°) oc 6(x- x f )a~ x 2 (y- y qx )-r—y qx (yx), 
dyx dyx 
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We do here not discuss the mathematical difficulties re- 
lated to the definition of functional integrals (see for ex- 
ample Glimm & Jaffe, 1987). Finally, in numerical calcu- 
lations we discretize all functions so integrals are replaced 
by sums. In the language of field theory we use a sharp 
ultraviolet cutoff. See for example, Bialek, Callan, Strong, 
1996, for a discussion of the continuum limit in density esti- 
mation. Wahba, 1983, shows the paradox that the expec- 
tation of L° = -(\/2) J dx(d 2 y/dx 2 ) 2 under p(y) = e L ° 
is infinite (See also Green & Silverman, 1994). The func- 
tional integral J dye~^ ' ^' '^ alone is according to the for- 
mula for Gaussian integrals formally (det((9)/(27r)) . For 
O = ( — (d 2 y/dx 2 ) + ra 2 ) which possesses a continuous spec- 
trum this determinate cannot be denned (see e.g. Roepsdorff, 
1991). In field theory renormalization group methods are 
used to find meaningful continuum limits (There is a huge lit- 
eratur about renormalization. See for example Zinn- Justin, 
1989, Itzykson & Drouffe, 1989 Le Bellac, 1991, Fernandez, 
Frohlich, & Sokal, 1992, Binney, Dowrick, Fisher, & New- 
man, 1992, and references therein.) 



and for nonlocal Gaussian questions like the usual 
smoothness questions 

i / y-y q [y x ] \ 2 

p(y\q,r) oce 2 V ** J t 

with y q [y x ] denoting a deterministic functional depend- 
ing on the set {y x }, 

-±L(y\ q ,f)ocaf(y-y q )^-y q [y x ]. 

For general questions we write for the log-posterior L — 
lnp and find 

d T ( i *0\ d { \ s0\ dj^P(y\Q>f ) 

Wx L(y\ q ,f) = w Jn P (y\ q J)= p{ykJ0) ■ 

Remembering p(y\x,f°) = n { p(yi\xi, f°), f dx f = 

Y,dlUt dx i (d \ and J d y = Y,dlYlt d y ( i d) the deriva- 
tive of the probability is found as 

^P(^\q,f) = ^jdx f jdyp(y\x f ,f)p(y^x f \y,q) 
= dx' dy 1^2 a ~^d)H x i ~ x )(Vi -Vx)\ 

xp(yW>f)p(y q >x'\y><i) 

for Gaussian p(y\x, f°). Not only for questions without 
input noise but also for d = 1, including possible input 
noise, the ^'-integration vanishes. In the latter case this 
gives 

dy (y - y x )p(y\x, f°)p(y q \x, y, q). 



The general stationarity conditions are obtained by 
setting the functional derivatives of the total log- 
posterior with respect to y x which parameterize the /° 
to zero 

where n is the number of training data D = {(qi,yi)}- 
The qi can be general nonlocal questions, and may for 
example be written in terms of distances to templates 
T x . For the sake of simplicity we will in this case use the 
term 'data' for discrete templates, i.e. those defined only 
for a discrete set of x, and will call templates defined for 
a continuous set X shortly templates. Now, we look to 
some examples. 

9.2 Local quadratic templates 

We study a situation where the log-posterior is a sum 
of quadratic terms each of them with templates T x only 
depending on one x. The standard case of training ex- 
amples consisting only of local Gaussian basis questions 
has 

n 

L D =Y^L( yi \ Xi ,f ) 

i 

The corresponding stationarity conditions are for a x = 1 

n n x 

= ^2H x - xi)(yi,x - yx) = $^(&> - yx) 



with n x the number of times x is in D q , yi^ x an answer 
to question x, and ^2 i = x not in the data. For x E D q 
we find for unrestricted F° the well-known mean square 
solution 



y x 



n .„ ^ — ' 



The y x for x (jt D q are arbitrary. 

Including other local Gaussian questions 



(y qx -yg x ) 



p(y 9 -\q*,f)<xe * 'i* , 
with y qx — y qx {y x ) a function of one y x only, we find 

n x 

= ^2^x 2 (vi,x -y x ) 



J2J2 a J(^^ -y^i~y^ 

q x i VX 



equivalent to 



yx 



l^i (T x yi,x + l^q x l^i X (T q x Vi,qx dy x ^ 

«^~ 2 + E fc v«~ 2 |^^ 



For linear y qx — a qx y x -\-b qx this is a linear equation with 
solution 



y x 



Y, qx a <i^ qx Y, n i qx {yiA X -K) 
Y, qx a l x n q x (T 7 x 2 



where we simplified the formula by including the x into 
the q x with a x = 1 and b x = 0. This gives the reweight- 
ing to be done for varying variances and scaling and the 
correction for varying bias. 

In general, sums of local quadratic terms 



2^ 



dxkl\\y x -T x 



i 1 1 2 



(22) 



= \T,<y-T i \A i \y-T i > 

i 

Y.(\<yW\y> - <y\ ki \ Ti > 



give linear stationarity equations. Here c is an /°- 
independent constant, A z are nonnegative diagonal ma- 
trices with matrix elements A^ x , — 8{x — x f )A x > 0, 
and we assumed real scalar products. We define the 
projector V % — \P ) into the T z -space spanned by the 
x with nonzero A x . Then A 1 commutes with this pro- 
jector A = V 1 A = AV l and without loss of generality 
the template is understood to be restricted to T z -space 
meaning T { =^'T z '. 

Sum of terms for several T % can be combined. To 
formulate this in general we define 



N = Y,T yi , 
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with diagonal elements J\f(x,x) giving the number of 
templates active for x, and 



A 



Z^ 



A* 



The total projector for J^. T % has two contributions 

with J\A depending on the pairwise overlaps of spaces 
defined by the V % (compare the AND for probabilities). 
For the example of two templates this reads 

-p = -p 1 -\-V 2 — V X V 2 . 

The operators M and A can be inverted in the space 
where its diagonal matrix elements n x are nonzero and 
we write for the inverse in that subspace 

A" 1 =p('PA'P)" 1 'P. 
Introducing the (A-weighted) sum 



rpj\ 



K l T 



< • | • > for a scalar product in a Hilbert space of func- 
tions y, we denote X)X > — 0(x, x r ) —< x \ O \ x f > also by 
angle brackets. This notation can be used for any hermi- 
tian linear operator O. Here the term X^ < y\0\y > is a 
quadratic regularization term in the sense of Tikhonov, 
and if the spaces F° = {/°|A < y{f)\0\y{f) >< c 
are compact for real c then in the limit A — » asymp- 
totic stability conditions hold (Tikhonov, 1963, Vapnik, 
1982). Here we are not especially interested in asymp- 
totic results (except if necessary to ensure the validity 
of the saddle point approximation), but for continuous 
x we will refer to the case with quadratic regularization 
functional and therefore linear stationarity equation as 
linear regularization. 
A matrix element 

< Vy\Vy >= \\Vy\\ 2 =< y\V^V\y >, 

with X>t denoting the adjoint of T>, gives an operator 
O = V^V on the domain where the operators are de- 
fined. To construct a smoothness functional for square 
integrable functions, V can be chosen as a hermitian lin- 
ear differential operator. A first order example is 71 



(with special case T = X^^) ^ ne mean (over i not x) 
of a set of templates T % can be written as 



T A = VT A = A" X T, 
(special case T = Af^ ^2 i T l ) and we have for the sum 

J2<y-T'\A'\y-r> 

i 

=<y-T A \A\y-T A > +J2 < Ti \ Ai \ Ti > ~ <T A \A\T A > 

i 

The difference 

n l 

VJ < T*'|A*'|r > - < T A \A\T A >= m VAR A ({T*'}), 

/■ 

proportional to a of a variance is y-independent and 
therefore irrelevant for the derivative if it only appears 
as additive term in L. 

9.3 Linear regularization 

Many smoothness and symmetry functionals are exam- 
ples for Gaussian nonlocal questions. Take as example a 
functional with L° of the form 

L°(f°)-c = -- dx \dx y x O x yy x > = -- < y\0\y >, 

(23) 
with a real symmetric positive (semi-)definite O repre- 
senting a Gaussian probability 70 and a constant c en- 
suring correct normalization. As we use angle brackets 



Thus, O is an inverse covariance operator C = 0~ . As 
matrix elements of an inverse operator are also called Green's 
functions, G(x,x f ) = C(x,x f ) = 0~ (x,x f ) are the Green's 
functions fulfilling DC = 1. HO has the form XI - O G 
are the matrix elements (kernel) of the resolvent operator. 
Usually the resolvent is seen as function a complex A with 
poles at the eigenvalues and a cut at the continuous spectrum 
of O. See for example chapter 7 in Glimm & Jaffe, 1987 and 
any book on functional analysis. 



v x . 



giving 



V(x, x ) — ib (x — x ) — —ib(x — x ) 



dx ' 



O x , x . = 0(x, x 1 ) = -S"(x - x 1 ) = -S(x - *') — • 



' dx 2 



Instead of giving a template for y x we could give also 
templates for other functions of y x . For example, a tem- 
plate T" for Vy can be written 

L°(f) = -\<Vy- T'\A T > \Vy - T 1 > 



- <y\V^K T >V\y> - <y\V^K T >T ( > 



c, 



for real scalar products and c an ^-independent con- 
stant. With O = D^A T D and f = V^K T >T this reads 



L°(/°) 



^<2/|0|2/>-<2/|f> 



We may express T" also by V and write T" = VT so that 
L°{f) = -\<Vy- VT\A T , \Vy -VT > 



<y-T\0\y-T> 



1 



^\\y-T\\ 2 . 
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1 Notice that this formal notation does not mean the op- 
erators are diagonal in the ^-representation. The ^-function 
only restricts the derivatives to the location x = x' , but the 
derivatives itself depend also on the neighborhood of x. This 
is most easily seen by replacing the derivatives with a finite 
difference approximation. The operator D to be hermitian on 
a function space requires boundary terms to vanish, as can 
be checked using partial integration. This is fulfilled e.g. for 
periodic functions on its periodicity interval or for functions 
which vanish asymptotically. 



Examples include distances in Sobolev spaces, where O 
consists of a sum over derivatives. A log-posterior of the 
form (23), having no term linear in y x , corresponds to 
a 'null' template, and no inhomogeneities appear in the 
corresponding derivative. 

The normalization constant for a d-dimensional y in 
the presence of linear terms is calculated according to 



n^ 



■U) 



-<y\0\y> + <y\T> 



= (27r)H d (detO)-h< T \°- 1 \ T >. 
We define analogously to the previous Subsection 



O 



L^ 



o\ 



and, with invertability in the space defined by a projector 

v, 

o~ x =v{vovy l v, 



J2o { T, 



and 



T°=Op 1 T°. 



The projected equation may be solved for example with 
the pseudo inverse. Also, O may be extended so it its 
inverse exists in the whole space. One may, for example, 
adding the identity on the zero space, a mass term m 2 T, 
or impose boundary conditions. 

Then, like in the local case, we have also for nondiag- 
onal O l , assumed to be real symmetric positive (semi-) 
definite, for a sum of quadratic terms 

Ao,T = J2<y- Ti \ oi \y- Ti> 

i 

=<y-T D \0\y-T D >+E o 
+ Y^<T i \&\1 i > - <T°\0\T P >, 

i 

with minimum ("ground state") 

Eo = J2 <^ i \O i \T i > - <T°|(9|T (9 > 



at 



y 



argmin Ac>,t = T , 



— o 
in the space on which V projects. Thus, we can call T 

the template average of the set of ZJ with respect to the 
norm induced by the O l . The standard average is a spe- 
cial case. Like for a standard mean also a template av- 
erage of two templates is always in the 'middle' between 

— o 
the two. That means, T has has equal ^-distance from 

both templates 



<T° -TilOlT* 9 -Ti >=< T° -T 2 \D\f -T 2 > 
This is easily seen because for O — O 1 + O 2 one has 

T° - Ti = O^O^Tx -T 2 ) = -(t° - T 2 



with the minus sign disappearing in a quadratic form. 

The functional derivative for the sum of a nonlocal 
quadratic L° as in (23) and a local quadratic term is 



dL(y) 
dy x 



dx'O x ^ x >y x > - Aj (y x - T x ) 



= - <x\G\y> - <x\K T \y > + <x|A T |T> . 

Mean square error terms are special examples of such 
terms and therefore encompassed by this formulation. 
The variable x can be multi-dimensional, assuming the 
vector is written in a basis where A T is diagonal. (The 
situation where for one x several different T x are avail- 
able will be discussed below.) For invertible A T + O the 
stationarity condition reads 



y 



(A T +C)- 1 A T T, 



(24) 



which is for a linear operator O a linear inhomogeneous 
equation. If not invertible in the full space components 
of the null space, i.e. solutions of (A T + 0)y — can be 
added to a special solution. In cases G~ l can be calcu- 
lated, it can be useful to rewrite Eq.(24) by separating 
the parts invariant under projection with V T from those 
which are not 

Oy = A T (T-y). 

Then the vector 

a = A T (T-y) =V T a 

being invariant under projection V T can be calculated 
purely within the T-space. If O is invertible, the un- 
known y in the definition of a can be eliminated using 
y = G~ l a, giving for a the equation 

(l + A T 0- 1 )a = A T T, (25) 

with X denoting the identity. Using V T A T = A T V T , 
a = V T a and T — V T T to insert the projector V T gives 



T \Tt>T\ 



(V 1 A 1 V 



{V'O-'V 1 )) a^V'T = T. 



Hence, only matrix elements of {V T D~ X V T ) within the 
T-space are required to solve for a. We now consider in 
more detail the cases zero and nonzero quadratic tem- 
plates. 

9.3.1 Homogeneous linear regularization 

Here we consider the special, but most common 
case of discrete Gaussian data with equal variance and 
quadratic nonlocal terms, e.g. smoothness, which can be 
seen as corresponding to a a zero-template T x — 0, \fx. 
Gaussian data terms with <j 2 — \ give 



d(L D +L°) 
dy x 



^2(yi,x -y x )-\ I dx f O X}X 'y X ' 



The stationarity condition reads 

(N D + \<D)y = D, 
Vector D has components 

n n x 

D x = ^2s(x- Xi)y ijX = y^g/j>, 



(26) 
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and the operator J\f D has matrix elements 

N° x ,=N D (x,x') 

n 

— 6{x — X f ) 2, H x ~ x i) — H x ~ xf ) n x, 

i 

giving the number of how often a specific x appears in 
the data. We define the projector V D , projecting into 
the space spanned by the x included in the data, by its 
matrix elements 



or in operator notation 



D = (Af D (V v O- 1 V D ) + XI) 



(28) 



The identity X commutes with V D and V D O 1 V D has 
the matrix elements G(x{, Xj). This is the equivalent of 
Eq.(25). 

If O has zero modes, then Eq.(27) becomes 



y = Oi\D-M D y) + Y j b 



kUk 



V® x , =V D (x 1 x') = 6(x-x')e \J^6(x-Xi) 



with the step function Q(x) restricting the matrix ele- 
ments to zero or one. Its number of nonzero diagonal 
elements, i.e. ft — TvV D — n — ^ x X^=2 1 * s ^ ne num ber 
of different x = X{ in the data. Then D = V D D and 
J\f D = J? D M D V D , with the operator J\f D being equal to 
V D , and therefore an identity in that subspace, if n x — 1 
for all x. This is the usual case when i.i.d. sampling for 
continuous x. In a space where (J\f D + XO) is invert- 
ible the linear, inhomogeneous (e.g. integro-differential) 
equation (26) has the solution 

y=(tf D +\0)~ 1 D. 

being a special case of Eq.(24) with D/X = A T T and 
J\f D /X — A T . Components of a null space may be added. 
The matrix elements O x x , = G(x,x f ) (or Green's func- 
tion) satisfy by definition 

OG(x,x') = 6(x-x'). 

For some O the Green's function can be calculated ana- 
lytically. Then the solution of the resulting equation 

y=\o- l {D-M D y) 



X 
l -'M D {D-y), 



(27) 



(with D — (MSd) D) or in components 



Vx 



^ CAx x-)^ i,Xl Vxi 






A 



Z^ 



Q>i\j'\X ^ Xi J , 



is for fixed x in a n-dimensional space spanned by known 
G(x,Xi) with different X{. In the vector 



(D-M D y) 
X 



with components 



v^ Vj,x 



Vx 



X 



only the components a Xi — a z - for X{ belonging to the 
data are not equal to zero. Inserting y = 0~ l a into the 
definition of a gives a n-dimensional matrix equation 



d 

E 



G(xi, Xj)cij + Xcii 



where u k represents an orthonormal basis of the zero 
space and 0\ denotes the restriction of O to the sub- 
space where its inverse exists. In addition one has in the 
space of zero modes 

Q = V°(D-N D y), 

where V° denotes the projector into the space of zero 
modes, i.e. 



V (x,x') = \ Uj(x)uj(x f ). 



This yields the two data space equations 

n x t d m 

^2vi,x t = n Xi y ^G{x i ,Xj)aj + Aa 8 -, + y^ b k u k (xj) 



^2/Uk{xi)ai - 0, V*. 
The (pseudo-differential) operator 

oo oo 



!2 r 



-v 2m , 



(29) 



with y 2m denoting the m-iterated Laplacian, results in a 
Gaussian G(x, x r ). Diagonal G(x, x r ) correspond to local 
questions, radially symmetric (Gaussian) Green's func- 
tions are called Radial Basis Functions. This and more 
examples, including various forms of splines, as well as 
the relation to conditionally positive definite and com- 
pletely monotonic functions can be found in Poggio & 
Girosi, 1990 Wahba, 1990, and Girosi, Jones, & Poggio, 
1995. 

Restricting the terms in the summation to a number 
smaller than n, corresponding to an additional prior or 
cost term, can be combined with an algorithm to deter- 
mine the optimal selection of Xj (e.g. the centers of Gaus- 
sians). Including for example the variance a x for the 
regularizer as parameter leads to nonlinear equations. 

9.3.2 Inhomogeneous linear regularization 

Choosing Gaussians data terms with equal variance 
in addition to a template T x we have to minimize 



Z^ 



(Vi,x ~ y Xi f + X T dx (y x - T x f, 
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or for h x — y x — T x 

y^jVi ~ h Xz - T Xi f + A T I dx {h x f , 



with a shifted local error term and a penalty for devia- 
tion from zero. For the unshifted parameterization the 
stationarity condition reads 



Vx 



n x + A T 



which means y x — T x for non-data points x ^ X{,i = 
1, ••-,«. 

The last example becomes more interesting if com- 
bined with a nonlocal term, for example a differential 
operator implementing a smoothness prior. Then one 
has to minimize 

5^(2/8 - Vxif + \t dx (y x - T x f 



+\ s dx dx' ' {y x O X)X >y x >) 2 , 
with stationarity equation 

dx 1 ((n x + A T ) + XiX ')y x ' =^2/ 8 " + A T T r . 

For the example XjX ' = S(x — x')-^ this gives the linear 
inhomogeneous differential equation 



dx5 + n * + XT 



Vx 



,Vi 



^tT x 



We see, that in cases of nonzero templates the station- 
arity equations become besides the always present <5— like 
data terms also continuous inhomogeneities. For con- 
tinuous x we will call the case where the regularization 
functional is a sum of a term quadratic in y and a term 
linear in y inhomogeneous linear regularization. 

Expressing for two templates that functions /° should 
be similar to T\ OR T2 leads to nonlinear equations, 
which we discuss in the next paragraph. 

9.4 Nonlinear regularization 

Priors which are constructed by combining quadratic 
subproperties d using a real valued extension of logic 
do not need to be quadratic in the y x , for example, if an 
OR is implemented in a soft and not in a hard version. 
Also one could use a parameterization of F° so y x is a 
nonlinear function of the parameters, 73 or use a tem- 
plate for nonlinear questions, like a correlation template 
for y x y x > in terms like 



WxVx 



T X)X '\\ Z 



Let us consider a case with two templates combined by 
a soft OR. The two templates could have been obtained 



Also for homogeneous linear regularization the stationar- 
ity equations are inhomogeneous, but the regularization func- 
tional adds nothing to the data inhomogeneities of 6-form. 

Many methods, like for example sigmoidal neural net- 
works use a nonlinear parameterization, but not independent 
for each y x (f )• One may contrast a genuine nonlinearity 
corresponding to a model of nature (i.e. F , L) and a non- 
linearity induced by choosing a nonlinear action model (F, 

0. 



by using the human interface discussed above and rep- 
resent two typical functions T 1 and T 2 to which the ac- 
tual y x (f°) is expected to be similar to at least one of 
them. For example, two such templates can be proto- 
typical structures for electrocardiograms, or patterns in 
financial time series. This situation might, for example, 
be naively approximated according to Eqs.(7) by includ- 
ing terms like 



Jdx(y x -Tl)^ (Jdx'd 



T 2 x ,f 



assuming a possible bound m incorporated into A7 1 . This 
non-quadratic regularization term has the functional 
derivative with respect to y x (f°) 



(ifc - T 3 



Tt) 



+(y, - Tl 



t) / dx (y. 

)Jdx(y x -Tlf 



yielding a nonlinear stationarity equation. We may 
therefore call this case nonlinear regularization, which 
in case of nonzero templates is also inhomogeneous. 

9.4.1 Finite temperature or mixture 
regularization 

We can choose the realization (8) for OR, if we assume 
that the properties we combine are log-probabilities, 



N t 



Zi 



' N, 



ME« 



L = L M =lnp M =lnJ2 

i 

- 2 (Ej?* <y- T ' 3 ' \°' 3 ' \y- T ' 3 '>) +<=. - ln z ' 



( 3 °) 

where the constants c z - can include the logarithm of in- 
verse normalization factors and weights for component i 
and 

Zi = e -i{T-r<^Ho^- T ^>) +ei (3i) 

and Z = ^2 i Zi . This form is general enough to in- 
clude the product terms of OR for non-disjunct events 
if Ci is allowed to be imaginary, giving a negative fac- 
tor for the exponential. For disjunct events the c z - are 
real, thus all contributions to the sum are positive and 
we speak of a mixture model or for continuous x of mix- 
ture regularization. For mixture models see Everitt & 
Hand, 1981, Titterington, Smith, & Makov, 1985, Kon- 
tkanen, Myllymaki, & Tirri, 1997. For every fixed y the 
log-probability (30) has the structure of a free energy 
of a system at temperature 1//?. Thus, emphasizing the 
temperature-dependence we will also speak of a thermic 
realization of the OR and, accordingly, a finite temper- 
ature regularization. In Section 5.3.4 we defined ener- 
gies as (/?-scaled) shifted log-probabilities of elementary 
events. Thus, with respect to the set of (disjunct) ele- 
mentary events i — coi E Q, the exponents define ener- 



m 
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< y-T iU \O iU \y-T iU > 



■Ci 



which are unique up to a factor (3 (inverse tempera- 
ture) and a constant. Here i represent possible, disjunct 
'states', and the function y plays the role of system pa- 
rameters which we can adapt to minimize the free en- 
ergy. Thus, the system is in state i\ OR %2 OR • • •. 
The variables (ji, x) label the subsystems (e.g. internal 
or microscopic degrees of freedom, like single particle 
coordinates and momenta in a many particle system). 
Every state or elementary event i is a complete collec- 
tion of states for all subsystems, labeled by (ji, x), i.e. 
subsystem x\ is in state a\ (e.g. y Xl — a) AND subsys- 
tem X2 is in state a2 AND • • •. The special form 30 has 
quadratic energies ('generalized oscillators' or 'general- 
ized free fields') 74 and is therefore for finite \X\ a Gaus- 
sian mixture model, or, for continuous X, a mixture of 
Gaussian processes. G %n defines the inverse covariance 
matrices of the processes. 

We may remark that in (30) p(f°\f) is written as mix- 
ture while in applications in density estimation a mix- 
ture model is often used for p(y\f°). In regularization 
every training data point (y z -,a? z -) gives only one x com- 
ponent of a whole vector y{x). In a density estimation 
problem a complete data vector with all its x compo- 
nents are given and d — \X\ is discrete and usually rel- 
atively small, and x is denoted by a discrete index like 
for example i or k. In regularization a ('data') vector 
corresponds to a function y{x) given \/x. For continu- 
ous x this is a realization or complete sample path of a 
stochastic process. In addition to the finite number of 
training data, the mixture components are determined 
by "continuous data" , i.e. templates T lJl (x), and cor- 
responding distances. To enable generalization at least 
one template has to be present, for example constant and 
equal to zero (zero template), together with a smooth- 
ness related O. The problem of constructing the prior 
is, for example, discussed in Section 5.2. At this step 
we assume the mixture model for the prior fixed, ex- 
cept possibly for a few remaining parameters, like /?, 
which can be adapted by cross-validation or after defin- 
ing a corresponding prior by explicit Bayesian integra- 
tion. Also, our interest is not restricted in identifying 
the maximal activated mixture component, like for ex- 
ample in deterministic clustering, which can be seen as 
low temperature approximation. Our problem is finding 
the maximal probable /°'*, which does not necessarily 
have to coincide with the center of one of the single mix- 
ture components, given only (a small, finite) part of its 
vector components (not mixture components) yi(x{). 

Especially interesting is the limiting case of very large 
(or infinite) dimension \X\ of the space X with only part 
of the components x given. To allow a useful degree of 
generalization in this situation we must enforce strong 
correlations between different dimensions x. This can 
be done, like in the example of smoothness, by choos- 
ing special (e.g. metric, differentiable) structures on the 
set X of dimensions x. An example of a finite version 



Which shall mean they are quadratic forms, which how- 
ever may include £-like forces (data), potentially higher order 
derivatives or nonlocal (e.g. in time if x corresponds to the 
time variable) terms, as well as linear terms in coordinates 
and derivatives (e.g. friction). 



of smoothness for the regression function is f y \_ y i ^ 

writing i for the dimension index x. This restricts the 
centers (means) to the neighborhood of the diagonals of 
adjacent dimensions. 

Similar to a mixture model for function approxima- 
tion is the clustering algorithm of (Rose, Gurewitz, & 
Fox, 1990). In clustering one tries to find for a given set 
of points Xi (y in our notation) a corresponding cluster 
centroid j (/° in our notation) with respect to an error 
function E(x,j) (L(y,f°) in our notation, or more pre- 
cisely, an approximation problem with I(y, /) and identi- 
fication of F with F°). Like in our case the temperatur 
parametrizes the convexity/concavity of the error sur- 
face, which in the case of clustering regulates the number 
of distinct centroids. The (3 parameter is the Lagrange 
parameter in a maximal entropy approach determining 
the average error. However, we do not assume the av- 
erage error to be fixed in advance and like often in sta- 
tistical physics, the temperature or (3 itself is the more 
natural parameter, even if uniquely related to the aver- 
age error. For an application in pairwise clustering see 
Hofmann and Buhmann, 1997, and references therein. 
In contrast to the mixture model we are considering 
here, the error or log-probability of this problem is not a 
"free" Gaussian model and the method is combined with 
an additional mean-field approximation. For the use of 
a temperature parameter in optimization and matching 
problems see Yuille, 1990, Yuille, Kosowsky,1994, Yuille, 
Stolorz, Utans, 1994, and for simulated annealing, for 
example, Aarts-Korts-1989. 

The sum over i in the mixture model is analogous to 
the integration over /° in the full risk. Indeed, replacing 
to obtain a more symmetric notation f df° by f df® and 

^2 i by / df i? , we can write for r(/, /) for an approxima- 
tion problem with F = F® 

r(f, /) = - fdf° fdf° fdyp(f°,f°)p D (y\f°,f°) lnp D (y\f) 



df 1 Jdyp{f 1 ) jdf° p(f° \fi)p D (y\fi,f%) lnp D (y\f) 
df? [dyp(f 1 )p D (y\f° 1 )lnp D (y\f). 



If the y*2 -integral (or summation over i) is performed 
exactly (or a saddle point approximation with multiple 
sadddle points is used) then 



P(fi)=Jdf 2 p(f 1 J 2 ), 



is a mixture of components of the form of p(f® , f%) i.e. 
for log-probabilities 

L(f 1 ) = \n[df 2 e^^\ 



An example of such an integration is given by the 
mixture-like elastic net energy (Durbin, Willshaw, 1987) 



Eri 



M N 
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with given vectors Xj and N > M vectors yi to be 
optimized with yjv+i = Vi- This energy E m i x can 
be obtained by summing over binary variables Vkj as 

-{i/p)\iiY,v e ~ fjE with 

E (v, y) = ^2 v kj + t J2 \y* ~ ^+! I 2 ' 

under the restriction X^? ^i = 1, V& (Yuille, 1990, 
Yuille, Stolorz, Utans, 1994). 

The form (30) uses one level of Gaussian mixtures. 
The one level structure is in principle no restriction, as 
every logical formula can be written in either conjunc- 
tive or disjunctive normal form. Those may however 
be very lengthy and contain negations, which one wants 
to avoid in continuous cases. Thus, a hierarchical model 
may be much more economical. Form (30) also uses with 
Gaussian processes the simplest possibility for a single 
mixture components. These can be seen as first term of 
a Taylor expansion of general more general mixture com- 
ponents. Correlation templates are higher order terms 

< y®y-T (2) \0 (2) \y®y-T (2) >, 
Here, y y are matrices with operator 

A (2) = (9 (2) = Y^O k ®Oi. 

acting on the enlarged vector space of those matrices. 
Analogously one may consider higher order terms 

n n 

A(") =< (g)y -T^\0^\(g)y - T^ > . 

i i 

The natural choice for templates is 

n 

For T< n ) = ®" Ti and 0( n ) = ®" d the A^ n ) factorize 

n 

n<y-T i \O i \y-T i > . 

i 

For linear Oi this non — quadratic interaction term (i.e. 
non-Gaussian probability) has minima at every y* ,% = 
Ti. Thus, for different Ti already one mixing compo- 
nent, or the corresponding energy, creates multi-modal, 
i.e. nonconvex or OR-like, functions. Those multi-model 
(interaction) terms can arise by integrating out (hid- 
den or latent) variables, e.g. microscopic degrees of free- 
dom. This integration or summation (over probabilities 
p = e~@ El /Z and not over energies Ei) is a realization of 
OR and results in a mixture model (with maybe an infi- 
nite number of components). Such an 'effective' energy 
E, represents from the point of view of the finer system 
a free energy i* 1 . 75 Most often, but not always as we shall 
see, effective energies are used in the range where they 
are approximately /^-independent. 

Instead of describing effective energies as the result of 
marginalizing, one can see the introduction of additional de- 
gree of freedoms as an improvement of an existing theory, 
which 'explain' a certain features of an older, i.e. then ef- 
fective, energy function. Such additional degrees of freedom 
are introduced, for example, when finding new particles in 
physics, or better explanations for diseases in medicine. 



9.5 Interactions and Landau— Ginzburg 
regularization 

Interaction terms introduced at the end of the previous 
section are another possibility to write nonconvex OR- 
like log-probabilities and they provide finally the connec- 
tion with our discussion a fuzzy implementation of prior 
knowledge in Section 5. There we pointed out, that one 
may use fuzzy properties not only for probabilities, but 
also possibly directly for log-probabilities. For such an 
interaction regularization we have, to specify an interac- 
tion model, like we have to specify the energy function 
(Hamiltonian) for a specific system in physics. Indeed, 
this specification of interactions for a physical model also 
provides an example that it is sometimes more natural 
to specify directly log-likelihoods (energies). 

Usually, those are taken as polynomial functions 



L = U 



lT.Il 



<y-T ij '\O ij '\y-T ij '> 
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( 32 ) 
In a mixture regularization with model (30) for the free 

energy the sum over i is an implementation of OR for 
log-probabilities. For the example (32), the product 
over ji can be interpreted as some fuzzy OR accord- 
ing to Eqs.(7) applied to properties defining the log- 
probability. If the model consists of one term one could 
speak of a 'zero temperature regularization with inter- 
actions'. On the other hand the interactions might be 
effective, i.e. chosen to approximate a more fundamen- 
tal mixture model. The interactions have with respect 
to the underlying model /?, i.e. also temperature, de- 
pendent parameters. In this case it is better to speak 
of an 'effective interaction regularization'. Effective, i.e. 
/^-dependent, interactions may be resulting from a Tay- 
lor expansion of the a 'true' underlying free energy or 
log-probability. In contrast to a high energy expansion, 
where L is expanded around 1/(3 — oo one may also ex- 
pand around a not infinite 1/(3 — 1/(3* . We will call 
the case where we introduce a temperature-like param- 
eter 1/(3 (or reduced temperature t — {1/(3) — (1/(3*)) in 
the energy in analogy to the celebrated phenomenologi- 
cal treatment of phase transitions in physics a Landau- 
Ginzburg regularization. (See Landau & Lifshitz, 1980, 
(§145) and for example Goldenfeld 1992; Safran 1994; 
Ivanchenko & Lisyansky, 1995.) The Landau-Ginzburg 
theory is used to model systems near a phase transition 
and many results at the critical temperature are inde- 
pendent from many details of the system (Universality 
classes with respect to critical exponents). These results, 
however, are not necessarily valid at the phase transition 
when fluctuations are important. This is , for example, 
the case in low dimensional systems with local, i.e. short 
range, forces. We are not only interested in an effective L 
in the immediate neighborhood of phase transitions, but 
it is this neighborhood where most problems can arise 
and where a nonlinear regularization is most different 
from a linear approach. 

Clearly, a mixture model (32) and interaction model 
(30) can be quantitatively quite different, they may how- 
ever share common qualitative features. Probabilities 



according to a mixture model can easily be generated 
in a two step process (first choose i according to the 
mixture coefficients, then draw from i) if the probabil- 
ity processes for the components i are available. In an 
interaction model a decomposition in simple Gaussian 
processes is not necessarily possible. The form (32) has 
however always a polynomial structure, which can be 
helpful for calculational purposes. For fourth order poly- 
nomials the solutions can always be given explicitly, so 
the two template case is analytically solvable. 76 Notice, 
that the usual mean square data terms are encompassed 
in both formulations, as operators G %n diagonal in x- 
representation and with only a finite number of nonzero 
elements. 

9.6 Mean field equations 

Now we get to the problem of solving these nonlinear sta- 
tionarity or mean field equations. We already discussed 
that a maximum does not change under strictly mono- 
tonically increasing functions h(L), i.e. functions with 
L(f) > L(f°) => h(L(f)) > h(L(f )) and %■ > 0. 
Strictly monotonically decreasing functions h(L) only 
change a maximum into a minimum. Including such 
functions h the stationarity conditions, obtained by set- 
ting the functional derivative to zero, read 

dh(L)dL(f°) dh(L)dL(y) 



TT& 1 







dL 



dL dy 



where in our case y represents the parameter vector /°, 
and we will see that iteration procedures can be related 
to different h. 

We will now give the mean field equations, for finite 
temperature and (effective) polynomial interaction reg- 
ularization, in the form 



Oy 



(33) 



with in general y-dependent O = O(y) and t = t(y), so 
the equation is nonlinear. 

9.6.1 Mean field equations for finite 
temperature regularization 

For a log-probability of mixture form (30) we find 

z z 



O 



M 



o z 



IT, (34) 



,M 



7fZ,0 



Z 



z 



with according to (31) y-dependent 



Z% - Zi{y) - e 



HZ^<y-^ 



z = z(y) = Y^My)> 



and y-independent 



o i = y^ oi 



6 Already 'solvable' nonlinear polynomial equations re- 
quire iteration methods: most roots have to be calculated 
iteratively. There is no doubt, however, that this can be 
done quite efficiently. 



3 1 

Thus, denoting by < • >(z t /z)(y) the expectation under 
the probability Zi/Z the stationarity equation can be 
written 







^ZjjO'y-T 1 
Z 



--<0*y-T> (Zi/zm . (35) 



Using the monotonic transformation h(L) = e L = p, or, 
equivalently, multiplying the mean field equation, by Z, 
gives 

O z y = T z -°. 

For only one template T 1 ^ 1 — T, i.e. t M — OT a trivial 
solution is always y = T '. The stationarity equation can 
be solved in the space where the inverse of O z exists, 
using for example the pseudo inverse. Alternatively, O z 
may be extended to be invertible, for example by adding 
the identity on the zero space, a mass term proportional 
to the identity operator, or by imposing boundary con- 
ditions. 

In the high temperature limit /?—*■() and c z - = c, Vi 
all Zi become equal so that 

y = (O^-'T 2 ' => y = O^T° = T°. 

O 

Hence, we find the template average T as high tem- 
perature limit of the mean field solution for the mixture 
model. 

In the low temperature limit (3 — » oo only the largest 
Zi survive(s), so that all T with ("positive error gap") 



< 



/ \ ^ ^ rpO l rpi k i I /T\i k i I rpO l rpi k i ^ 



2c£ 



2c,w 



for all i f ^ i in the limit (3 — » oo, become a low temper- 
ature solution. 

Using the same (invertible) operator for all mixture 
components so that O l — O/Ni results in the equation 



y = T 



E 



— o l 
ZjT 

Z 



(36) 



The equation is still nonlinear, because of Zi — Zi(y), 
and the y are still nonlocally coupled. In this situa- 
tion the space of possible solutions is the convex hull 



T^O 1 



spanned by the low temperature limits T , which are 



TT^O 



y-independent. The high temperature limit becomes T 
— jjr- ^\ * T . We will call the ^-distance 



do(y, y) = \f<y-y*\0\y-tf > 



ly-ffWh, 



(37) 

the canonical distance of a finite temperature regulariza- 
tion problem with O = O l , and 

do(y>yf) 



■*>G,T 



(y,y r ) 
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UjdoiT^TJ)' 



(38) 



the normalized canonical distance. Solutions y* of 
Eq.(36) must have < d ,T(y* ~ T 1 ) < 1, for all i. 
Notice, that the canonical distance of a regularization 
problem depends over J\f D and the normalized canonical 
distance over J\f D and D from the actual given data. 

For equal O l the exponents can be diagonalized si- 
multaneously, so that 



<y 



■T?\0\y- 



Tf>= 



■<y- 



■T?\UVtf\y- 



Tf> 



with diagonal V. Hence, in the corresponding eigen- 
vector representation the operators in all exponents are 
local simultaneously. We saw already that O is data 
dependent, and so are therefore also its eigenvectors. 

9.6.2 Mean field equations for regularization 
with polynomial interaction 

For the example (32) one finds with h(L) — g~ l (L) 



<y = j2J2 MiU<Di 



j2J2 Mijt ° ijtTk 



(39) 



with 



M iji = M iji (y)= Y[ <y-T ik *\O ik *\y-T ik *> . 

In the case of only one template T 1 ^ 1 — T, i.e. t 1 — OT 
this has also as trivial solution y = T '. For O l — O/rii 
the equation reduces to 



y 



T,iT, u MijlTk 

M 



with 



M = M(y) = J2J2 Mi ' 



Hence, the solutions are restricted to the space spanned 
by convex combinations of the T ZJi . 

In most cases nonlinear equations can only be solved 
numerically with the help of iteration procedures. 

9.7 Iteration procedures = learning algorithms 

Here we will discuss iteration procedures to solve for the 
inhomogeneous integro-differential equations. 77 Itera- 
tion procedures correspond to the actual learning algo- 
rithms. We consider here their application to MaP equa- 
tions, but the methods also apply to MiR equations or 
a full Bayesian approach. 

An iteration procedure is defined by a function G 



f 



■0,2 + 1 



G l {f> 1 ) 



producing new guesses f ^^ 1 from a current guess / 0,z , 
and with fixed points being solutions of the stationar- 
ity conditions. We restrict in the following to solving 
the stationarity conditions, and assume the maximum 



For an introduction to numerical methods see for exam- 
ple Hackbusch, 1989, Press, Teukolsky, Vetterling, Flannery, 
1992, and references therein. 



(or minimum, saddle point) conditions, i.e. the second 
derivatives, to be checked separately. We can construct 
an iteration procedure by choosing a function H l , which 
we also allow in general to be (also stochastically) i- 
dependent, and write for y l+1 = G l (y l ) 



■o,,-+i = G .-(/)..-) = f,i + H\ dL ^ y 



f 



to ensure fixed points are solutions of the stationarity 
equations we require H l (0) = 0, and not to create ad- 
ditional spurious solutions one must have H l (x) ^ 
for x ^ 0. We can fulfill those conditions by defin- 
ing a (possibly ^-dependent) nonsingular linear map- 
ping (a matrix for vector x) H l (x) acting on x with 
det(7T(x)) / 0,Vx / 0, by H^x) = H i (x)x. The 
matrix H l (x) is usually chosen positive definite (and 
symmetric) for a maximization problem. H l (x) may re- 
sult from a (bijective) transformation of the independent 
variables /° = T l (f f0 ) and/or from a strictly monotonic 
transformation h l (L) of the log-posterior. For every- 
where differentiable transformations the chain rule gives 

dW(L(THf ))) = dhtjL) dL(f) dT(f' ) 
df f0 dL df° df f0 ' 

Strictly monotonic and therefore invertible transforma- 
tions h l , T % do not lead to additional stationary points, 
i.e. spurious solutions. We write 



/' 



0,8 + 1 



f> l + H\f> 1 ) 



o^dL(f^) 



f0,i 



(40) 



For example, gradient, Newton, or Quasi-Newton algo- 
rithms correspond to special H l . The /-dependence al- 
lows also to include methods like conjugate gradient. In 
general an iteration procedure can be given by an im- 
plicit equation G z '(/°' z ', /°' z+1 ) = 0. A solution /°' z+1 , 
however, has to be given in an explicit form. We can 
formally extend the parameter vector /° to a popula- 
tion vector (of parameter vectors) by introducing dummy 
variables m (population indices), split /° into several /^. 
Then H l can induce interactions between different /^. 
Iteration procedures of order m can for example be ob- 
tained by 



/' 



0,2 + 1 



/°'* + 



Z^ 



H 1 ' 3 {{f ' }i-m + l<k<i) 



dL(f<i) 



fo.j 



-m + 1 
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The matrix H l can be a stochastic function (e.g. stochas- 
tic annealing) deterministic only in some (zero temper- 
ature) limit or individual H l can be chosen nonsingular 
only in subspaces like in line search algorithms, as long 
as a higher iterated equation can be written with a non- 
singular H and the convergence check is made for this 
iterated equation. Similarly, for transformation methods 
(homotopy, EM-like algorithms) only at a fixed point of 
the iteration procedure the log-posterior L — L l corre- 
sponds to a strictly monotone transformation of the orig- 
inal problem. For example, U can represent a smoothed 
version of L (e.g. deterministic annealing). 

If G % contains integrals over y x , like in the nonlinear 
terms, already much 'nonlocal information' may be con- 
tained in one iteration with G % . Such integrals are for 



example introduced by the EM algorithm which defines 
hidden variables u, i.e. 



A-UOy-t) 



or in components 



*/•> = ««'"> = /*,*/•,„> = /*.««>•■■> GM = j dt , A ^ ^_j d ^,^ 



with also nonnegative p(f°,u) > 0. Then we define the 
corresponding conditioned variables which have by con- 
struction equal norm if summed over u, independent of 

^' = -"■"•' = 7^5- 

This allows to add to L a cross-entropy term in the con- 
ditioned variables 

L(f°)+ I ' due L ^°' r ' f )L(u\f) (41) 



due L(u \f 0,TsS ^L(f\u) = Q(f°,f°' ref ). 



This transformation is strictly maximum sufhcent rela- 
tive to L(f 0,re f ) (see Section 6.5) by construction, i.e. be- 
cause conditioned variables have ^-independent norm 
over u, the additional term is positive and maximal if 
L(u\f°) = L(i/|/°' re/ ),Vi/, (See Section 5.3.4), which is 
the case for f° = /°' re ^. The reference point f°> re f has 
to be adapted during iteration. This has to be done 
at latest if a local maximum for fixed f°> re f is reached. 
However, maximizing Q does not require necessarily to 
find a maximum for every fixed /°> re ^ it is enough to 
increase Q with every iteration (Generalized EM). Sum- 
marizing, EM-like algorithms use a transformation h l (L) 
which is strictly monotonic only at a fixed point and dur- 
ing iteration only strictly maximality (or, respectively, 
minimality) sufficient relative to some previous guess 
(see Section 6.5). 

While nonsingular H l ensure that the fixed points of 
the iteration procedure are zeros of the gradient, this 
does not guarantee convergence. In general iteration 
procedures can produce all varieties of features known 
from discrete dynamical systems, including limit cycles 
or chaotic behavior (See for example Devaney, 1986, 
Beck & Schlogl, 1993 and references therein). 

We now discuss the example Eq.(33) in more detail. 
Firstly, we write Eq.(33) in a form 

y = G(y), 

by choosing some additive decomposition O(y) = A{y)-\- 
B(y), or a decomposition of some H l (0), with A posi- 
tive definite and therefore invertible. This can also be 
done by directly selecting a convenient A, for exam- 
ple y-independent, which then defines a corresponding 
B = 0-A. 

Here A is usually chosen to be a linear operator, but 
invertability and not linearity is the crucial property. 
(However, inverting an nonlinear operator has usually 
to be done again by iteration, requiring another linear 
operator to be inverted.) Then we obtain an iteration 
procedure by defining the left hand side to be y l+1 and 
y on the right hand side to be y 1 . For our examples this 
gives a G(y) of the form 

G(y) = A- 1 (t - By) , 



The equations, or their variants described below, are 
solved by choosing a representation (i.e. a linear ba- 
sis) to write it in component form, for example, in x- 
representation: 

yl +1 = G x (f). 

In other cases one may prefer to work in another ba- 
sis, for example plane waves (or general coherent states, 
Blaizot & Ripka, 1986) to transform a differential equa- 
tion into an algebraic equation. 

To achieve convergence one usually has to include a 
step-size ??, a method which is also called relaxation. 
Then a new guess y l+1 is generated from a previous guess 
y 1 by mixing only part of the new solution to the old one, 
giving 
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8 + 1 



G v (y i ) = (l-r i )y i ^r i G(y i ) 



= y l + l(G(f)-y l ) (42) 

dL(f) 



y'+A-'if)- 



dy i 



where rj can be included in the definition of the operator 
A~ l — r]A~ l . The expression t — Oy is the gradient of 
the log-probability L (or negative gradient of the energy 
E) or the (negative or positive, respectively) residual at 
point y. For linear A we recognize an iteration procedure 
of the form (40) with 

H{L)^r]A- 1 . 

When a linear approximation of — is possible in the 

neighborhood of some y° the convergence depends on 
the spectral radius of X + A~ l 7l{y®), where 7i denotes 
the Hessian and X the identity. In the linear approx- 
imation the Newton algorithm is optimal. For a linear 
equation and fixed A, choosing rj < 1 is also called under- 
relaxation, and rj > 1, which can improve convergence, is 
called overrelaxation. (See for example Press, Teukolsky, 
Vetterling, Flannery, 1992). For example, in the finite 
temperature model (30) the Hessian for the functional 
p = ^2iZi/Z reads 



n M (p) = -PY t Zi(o i + po i \y 



■T ><y 



■T^ \0 % 
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8 Multiplying with (l/rj)A and projecting onto an in- 
finitesimal < dy I the iteration procedure y 1 ^ = y % — 
?^4 _1 (^)(0(^)^ - t{y 1 )) can be written 1/?? <dy\A\Af> 
= < dy\(8L/ 8y)\y = yi >= dL. For infinitesimal \Ay l > = 
{y 1 ^ 1 —y l > approximately equal to dy this shows that for pos- 
itive (semi) definite A the differential dL is larger (or equal) 
to zero. Thus, the functional L increases during iteration for 
7] small enough. 



while one finds for L — In ^2 i Zi/Z 



Zi 



n M = J2 -4 (-po* + P 2 \o l y - Txo'y - r'l) 



-/? 2 



Efi^ 



r><0^-r J '|| 

Zi 



Notice, that in the high temperatur limit the terms pro- 
portional to f3 2 vanish faster than —j3 ^2 i ~^-O l . For only 
one mixture component the last two terms compensate 
for L — In Z and the second term vanishes at the sta- 
tionary point y — G~ l T for p — Z . For large deviations 
O l y—T % the Hessian does not need to be negative definite 
even for positive definite O. Similarly, the second deriva- 
tive d(dg- 1 (L)/dy(x)) /dy{x r ) = d (t 7 - O 1 y) / dy{x r ) 
gives for the Landau-Ginzburg model (39) 



n 1 



j2J2 MijiQi ' 



variables T(/°) — z — vAy. For positive definite A the 
square root exists and we have: 



dL(z) 
dz 



dL(y) 
dy 



Here the square root is equal to its transpose as we un- 
derstand positive definite to include symmetric. Thus, 
by multiplying with yA one finds 



V 



;« + l 



y l + rjA 



O z i + 1 



dy 

dLiz 1 ) 
dz 



rj- 



To apply EM-like algorithms we can choose for a log- 
posterior of the form (30) with only positive terms (c z - 
real) the summation index i (which has nothing to do 
with the iteration i) as hidden variable 

p(f)= ldup{f,u) = Y,P{f,Ui), 



J2 J2 J2 M ijiki O iji \y - T iU ><y- T iU \0 { - 



with 



i ji k z 

M ijiki = JI <y-T ih \O iU \y-T ih > . 

G' r? (y z ) = G^i yi\(i^y l ) can be chosen /-dependent by 
varying rj — rf or A — Au\, which can include depen- 
dence on past values y 0,k , k < i. The operator A should 
be chosen adapted to the problem, i.e. approximating 
O or, at least near a stationary point even better, the 
Hessian 7i. A not exactly positive definit A might be 
helpful in the beginning of an iteration step, if easy to 
invert and leading 'mainly' in the right direction. Then 
a proper A (e.g. equal to the identity, see below) can be 
chosen in subsequent iterations, when the solution y is 
already approximately correct. Convergence is not nec- 
essarily guaranteed, but depends, besides on choosing a 
good initial guess, on adjusting the relaxation factor rj. 
Choosing \r}\ small enough the change \y l+1 —y % | becomes 
arbitrary small, increasing the 'resolution' of the search 
and also reducing oscillations. This usually allows meth- 
ods which search in directions with always positive (or 
negative) projections on the gradient to reach at least 
a local maximum (minimum) if the step-size is small 
enough. (For convergence results see for example Bert- 
sekas, 1995, Golden, 1996, and references therein.) The 
step-size can also be determined by a line search in the 
direction given by G l (y l ) — y 1 . 

The usual gradient algorithm or method of steepest 
descent is a special iteration procedure of this type: If 
dL/dy — (Oy — t) and if A is the identity operator, 
then the term y — G(y) is the gradient of L. Iteration 
schemes can also be related to the gradient of other sur- 
faces but with the same stationary points. For example, 
for Eq.(33) the iteration (42) is for A — 1 a gradient 



algorithm on the surface p 



and for a general lin- 



ear operator A a gradient on a surface parameterized by 



with 

p(/°,«,-) = e L( ' ' Ui) = e" 
so that 



§_ sr^j l 



Y, 3 ^<y-T l ^\O l ^\y-T^^> + 



~ i ( E^ <y-T iJi \° %H \y-T ij i>) +c, 
e v J _ Zi_ 

v- -i{Y,V l <y- Tl3i \° l3i \y- Tl3i >]^ z 

and choosing a reference f°> re f 



Q(/°,/°' re/ ) 



Y^i Z i GflnZ i TT^ e / 



Z re f 



In Zi 



where 



z ref = ^§{j2^<r ef -T^\0^\y^-T^>^c 



z~' = 5>r 



ef 



f Ni. 



In Zi 



^2<y-T iU \O iU \y-T iU > 



+ C{. 



So the stationarity condition for fixed f°> re f reads 



with 



CT ef y = T ef , 



/r\' e J L^dj i yref //ire/ 



T 



Z re f 

ef Y^zr ! T 
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Z re f 
or, after multiplying by Z re $ , 

Qref - _ jiref 



grefjiref 



With O hJl being linear operators and therefore quadratic 
In Zi, this is a linear equation, which can be solved in 
one step by inverting O re ^ . In case, a direct inversion 
is not feasible, this inversion can also be approximated 
by iterative procedures. In general, however, the In Zi 
can be a non-quadratic function. The model F can for 
example allow varying mixture coefficients (included in 
the Ci) or different variances (included in the factors of 
O l]l ) which then have to be included in the optimiza- 
tion process. Such additional parameters are part of 
the description of /° . If we implement them by a 'hard 
OR' with uniform prior on the allowed space this gives 
not rise to additional terms and means practically min- 
imizing the resulting equations also with respect to the 
additional parameters. In general we can also add prior 
terms for the additional parameters. One must be care- 
ful however about the range of parameters consistent 
with prior knowledge. Allowing for example to optimize 
the relative weight of data and smoothness terms on the 
training set can end up in the so called ^-catastrophe ', 
i.e. a solution having peaks at every data point and in 
case of a zero template being zero elsewhere, a situation 
most times not intended to be a very likely member of 
F° . Thus, the EM algorithm for Gaussian mixtures can 
be seen as a method solving a reference equation linear 
in y. The stationarity equations for fixed reference /° 
can become at least partly nonlinear if O lJl , c z - or the y 
itself are parameterized nonlinearly. One may also use 
cross-validation to determine those parameters. 

The EM transformation (41) does not yet define the 
maximization procedure used to maximize Q(/°, /°' re ^), 
i.e. a H l (O re f) can be chosen and splitted in A and B in 
various ways. Every iteration procedure has to separate 
the occurrences of y in the stationarity conditions into 
old y % and new y z+1 . An ^M-algorithm treats occur- 
rences of y at two time scales: some are renewed during 
maximizing Q, others when changing y re f . Thus, Q may 
be maximized by any method, including such based on 
a random search, gradient-like, or EM-like algorithms. 
See for example the Helmholtz machine (Dayan, Hinton, 
Neal, & Zemel, 1995; Hinton, Dayan, Frey, Neal, 1995) 
for an application of the EM algorithm to hierarchically 
defined p(/°). 

Table 4 gives the, in general ieration dependent, ma- 
trices A^ for some common iteration (or learning) pro- 
cedures. They are special cases of relaxation techniques 
and most of them are local, i.e. they only depend on 
one previous guess y 1 and derivatives at that location, 
The gradient corresponds to choosing A~ l equalt to 
the identity X. Jacobi iteration uses a diagonal A, e.g. 
the diagonal part of O, the Gauss-Seidel method in- 
cludes also the lower triangular part, e.g. of O. New- 
ton's method takes the negative Hessian 7i for mini- 
mization and maximization. More precisly, the Newton 
method uses the given formula at locations where the 
Hessian is negative definite (or positive definite for mini- 
mization), at other locations the method has to resort 
to any other minimization algorithm. Quasi-Newton 
methods try to approximate the Hessian 7i. In the 



table the abbreviations A 



(o) 



v 



y 1 



t- 1 and Ai 



Local learning algorithms 



Gradient 

Jacobi 

Gauss-Seidel 

Newton 

Quasi-Newton 

( DFP ) 
CG 

EM 



A~ 



.(0 



I 



A 1 ( z ) diagonal 
A~ l (i\ triangular 



A' 1 



d 2 L 

' (dy(x)dy(x')\y l 
A* A*, 



-%- 



A- 1 - A- 1 + (°> (°> 



(i) 



A 



^-i) A (D A (i) a 7j- 



(0, 



1- 



(»-i) dL\ 



*(d 



r] 1 - 1 dy \y l ~ 1 ((dL_\ T dL_\\ ._ 

\\ dy J dy J I y l 



A 



A 



,-ldh^UL) 



(«) - •" (»") dL 

with h re f(L) = 

L + Jdue L ^f°- r '^L(u\f ) 



Table 4: Some local learning algorithms 



dL I 



are used. The given formula refers to the 



dL I 

dy \y l dy \y l 

DFP (Davidon-Fletcher-Powell) method, in the BFGS 
(Broyden-Fletcher-Goldfarb-Shanno) method, for ex- 



, T 



ample, a term a l b l b l is added with a 1 = A z n >, H l l A 



(i) 



\i) 



\&b l 



a;, 



i£l_ 



H l ~ 1 Al 



ill 



(T denotes the transpose). 



(i) 
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(o) (i) 

Conjugate gradient methods (CG) determine the step- 
size by a line search in conjugate directions, obtained by 
a Gram-Schmidt-procedure. Directions are called con- 
jugate if they are orthogonal in O-distance, assuming 
we are solving Oy = t. For non-quadratic problems 
they are usually combined with a heuristic to restart the 
Gram-Schmidt procedure. (See for example Bertsekas, 
1995.) For the EM algorithm H f defines the chosen it- 
eration algorithms used for Q with fixed reference state. 
EM algorithms are not restricted to local methods. (So 
for them the word local in the table caption does not 
necessarily apply.) Note that dL/df° in the nonlinear 
case usually contains integrals over x and therefore even 
optimization methods which are local with respect to /° 
are nonlocal with respect to x. 

Nonlinear equations do normally have multiple solu- 
tions corresponding to several extremal points. If several 
solutions have to be considered in the last risk minimiza- 
tion step of a MaP-MiR approximation this requires cal- 
culation of the relative weight factors of the solutions or 
widths of the maxima depending on the second deriva- 
tives, or a corresponding estimate or assumption (See for 
example Gelman, Carlin, Stern & Rubin 1995 and espe- 
cially for neural networks: Buntine & Weigend, 1991; 
MacKay 1991, 1992b, 1992c; Neal, 1996). 

Nonlinear inhomogeneous equations appear for exam- 
ple in scattering theory as approximation to higher di- 
mensional linear inhomogeneous equations (See for ex- 
ample, Austern, 1970, Taylor, 1972, Newton, 1982). 
There the inhomogeneities (data) are related to the in 
and out channels representing the boundary conditions 
or asymptotic states of the wave functions. Numeri- 



cal aspects, applications to scattering theory and related 
higher order approximations are, for example, discussed 
in (Giraud & Nagarajan, 1991, Lemm, Giraud, & Wei- 
guny, 1994 and Lemm, 1995ab). 79 

Nonlocal templates, or technically the inhomo- 
geneities, can be used in the following way: Instead of 
using a small space F® to represent possible states /° of 
nature and corresponding to hard implemented priors, 
one allows a larger space ^ and implements f® E F® 
within F$ as priors with an soft OR by taking f® E F® 
as templates for F$ . This allows to go beyond F® if the 
data require. A soft implemented template for p(f° \f) is 
not equivalent to using noisy answers for /° : The state of 
knowledge about /°, that is p(/°|/), is updated through 
data, while the noise levels of pure states are assumed to 
be stationary, i.e. clamped during learning. Templates 
can be seen as a method of transfer of knowledge between 
tasks. 

10 An introductory example 

10.1 The models 

To exemplify the techniques we study a case with one- 
dimensional x and two full templates, T 1 , T 2 , i.e. which 
are defined for all x, in addition to standard data D. We 
will study as well the mixture as interaction regulariza- 
tions. 

10.2 Finite temperature regularization 

To express (D AND T 1 ) OR (D AND T 2 ) we choose a 
probability of the form 

p (y) = e = -z— = —z — 

Zj po Zj po 

with normalization constant 



Z 



F o 



rO 



(zAn + zzif)) 



F o 



and Gaussian components 

P OC Z = e -/H( A c+ A i) + c i + e -/3|(A D +A 2 ) + c 2 



For example in the Time Independent Mean Field The- 
ory (TIMF) for quantum mechanical scattering one ob- 
taines approximate variational solutions for matrix elements 

< x'lV^lx > (e.g. O = E - H, with energy E and 
Hamiltonian H so 0~ is the resolvent of H) by choos- 
ing </>, <f> from a space of possible trial functions for which 

< 0~ x ~ <t>'\0\0~ x ~ ^ > is stationary. Notice the sim- 
ilarity in the role of 0~ x or &~ X an d that of a template 
average 0~ T = 0~ ^C^T\ For a mean field approach 
one chooses </>, <f> as product of single particle functions. Ex- 
panding the quadratic functional gives as variational solution 

< x'p- 1 ^ > = < x '\^> + < 0'| x > - < 0TO_>. 

In contrast to the error minimization problems in scattering 
theory E is in general a complex number and the wave func- 
tions Xj X j 0) ^ are allowed to be complex functions. Thus, 
the stationary points are not maxima or minima but saddle 
points, and the variational solutions do not yield bounds for 
the exact solutions. At a saddle point, on the other hand, 
the effect on the numerical value of the matrix element of 
different directions of deviations of </>, <f>* from the true so- 
lution can have different signs. Deviations from the optimal 
solution can therefore partly compensate, improving the vari- 
ational solution. 



= e-4 A ^2cosh^(A 2 -A 1 ) 

j3 A 1 +A 2 ( B ~ \ 

oc e~~ 5 2cosh j(A 2 - Ai) , 

with parameter (3 > 0, c z - real, 

A D =<y-D\0 D \y-D>, 

A i =<y-T i \O s - i \y-T i >, 
and for combining data and template term in the same 



exponent 

A 8 - = A i ^^<T i ^\O il ^T iJ >-<J pt \O i \f 0t > -2ci//3, 



A,=<y-T°V'|y-T°*>, 

t° 1 = ((9y _1 ^(7 j r j ', 

3 

T i > 1 = D, T 1 > 2 = T 1 , T 2 > 2 = T 2 , 
and we will write 

rpi rpi , 1 , rpi , 2 

The data operator is diagonal in ^-representation 

O d =\ D N D , 

and for the template operator O s we choose the same for 
T\ and T 2 , i.e. G S)l — G S)2 . This gives for the operator 

o M = o z 
o M = o z = ^J2 e ~ §A ' oi = ( XD ^ D + ° s )- ( 43 ) 

i 

Thus, the two terms of Eq.34 corresponding to i — 1 and 
i — 2 coincide and the factor Z — ^2 i Zi cancels out. 

We select an operator related to a smoothness mea- 
sure, 

s = A o 0i+A 2 2 + A 4 04, 

with 

Oi(x, x r ) — X(x, x r ) — 8{x — x r ), 

d 2 
2 (x, x r ) — -S(x - #')-— r, 
ax 1 



OAx, x r ) — 8{x — x r )—-—. 
dx A 

The \ and A^ allow changing the weight of the four 

parts. The inhomogeneous side has the form 



-z e l 2 



/3j(A D + Ai) + ci 



T" = (\ D Af D D + O s T 1 



\ D N D D + O s T 2 ), (44) 
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-/3i(A D + A 2 ) + c 2 



z 

with Z oc P . As the operators are the same for both 
T % they have equal normalization constants and the c z - 
are directly the logarithms of the mixture coefficients. 
The proportionality factor Zpo cancels out as well as 



the variance-like term arising from combining the data 
templates terms for equal x. For the model equation 

g <7 

O y — T we could, for example, write in the general 

case O 1 ^ O 2 



+ tanh ( J(A 2 - Ax) J * 1 ] y 



1 T 1 ^0 2 T 2 ,//?,! ~ A A0 1 T 1 -0 2 T 2 
= g + U ( 2 " l} ) 2 * 

However, we have already seen in Section 9.6.1 that in 
the case of equal operators O l the equation 

{\ D M D + O 8 ) y = M D D 



+ z 

can be simplified to 



O s T x 






O s T 2 






Z 



(45) 



For two templates this may also be written as 

y = T^+ tanh (^(A 2 - AOj . 

Eq. (45) shows that in this simplest case of only two 
templates with equal operators O l the space of solutions 
is effectively one-dimensional. It is the line spanned by 

— o l 
convex combinations of the two T . (The superscript 

° is kept as reminder that an operator inversion (O 1 ) -1 
is needed to obtain this template average and the i- 
dependence of the T % remains.) We also see, that for 
(3 — where the tanh is also zero, this gives the correct 

high temperature solution y(/3 = 0) = T . For /3 — » oo 
the tanh becomes ±1 depending on the sign of A 2 — Ai. 
Hence, we find correctly as self-consistent low tempera- 

— o 1 — o 2 

ture solutions the component templates T and T 

For the symmetric case with A 2 — Ai = A 2 — Ai one 

stationary solution in "P-space is easily found. Then the 

— o - - . . 

equation y — T for A 2 — Ai = is consistent with the 

definition of the template average for which we found 

Ai = A 2 . We will see however, that this solution of 

the stationarity equation is only a minimum for small 

enough /?, (the "high temperature phase"). 

Choosing only one x, i.e. \X\ = 1, with O s = 1 and 

T\ — T 1 = 1, T 2 = T 2 = — 1 the model equation reduces 

to the celebrated mean field equation of a ferromagnet 

with uniform couplings 



y = tanh (fiy) . 



(46) 



Here the templates T % (representing prior knowledge) 
are the analogon of possible states 80 of a physical sys- 



tem (which has also to be specified a-priori) 81 and the 
mean y represents in both cases the observable we are 
interested in. The data, which update our knowledge 
can be called local fields, changing (correcting) a-priori 
given templates ZJ to combined templates T z , i.e. the 
final (posterior) mixture components. 

At (3 — 1 Eq.(46) shows a bifurcation phenomena, as 
for (3 < 1 there is only one solution y = 0, while for 
(3 > 1 two new solutions appear and y — becomes 
unstable. If the equation is seen as a phenomenologi- 
cal description of a large (or infinite) system this is also 
called a phase transition. Indeed, such bifurcations or 
phase transitions are typical for mixture distributions. 
For example, magnetic systems have many connections 
to neural networks, especially to Hopfield nets, see e.g. 
Hertz, Krogh, & Palmer 1991. A interesting clustering 
algorithm showing these phase transitions can be found 
in Rose, Gurewitz, & Fox, 1990. For a real magnetic im- 
plementation see also Blatt, Wiseman, & Domany, 1997. 

10.2.1 Some remarks on local templates 

Generalization requires correlations between different 
x, i.e. nonlocal dependencies. We implemented those de- 
pendendencies by giving explicitly global templates. Al- 
ternatively, those dependencies can arise from coupled 
local templates T l x . Local templates means that we con- 
struct a model by adding mixture components for every 
possible local state i x for every single x E X. Nonlocal 
dependencies are present if mixture terms for x depend 
also on x 1 ^ x. For the whole system the number of 
states grows exponentially with \X\. This leads for large 
\X\ (especially in the limit of continuous x ) to two ob- 
vious problems: 



Not equal to the states / of possible y. 
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One may consider templates for physical systems (states) 
to be more 'real' than templates originating from fuzzy im- 
plementation of prior knowledge. This means, however, only 
that for physical systems a lower temperature can be actually 
realized. With regard to 'fuzzy' templates 'nature' is usually 
in a state of higher temperature and corresponding low tem- 
perature states may not be preparable. However, in physical 
systems the energy function (i.e. states) is (depending on the 
level of description) not exactly known. Then different possi- 
bilities have to be combined by OR, giving mixture probabil- 
ities. Examples are spin glasses where the variables on which 
the log-likelihood depends are separated into states, and in- 
teractions. (They are for spin glasses, however, treated asym- 
metrically: marginalization over interactions not states). As 
long as the probability distribution over different interac- 
tions cannot be made deterministic and one does not restrict 
to non-fluctuating (self-averaging) observables (i.e. those for 
which averaging over interactions can be skipped) a spin glass 
cannot be prepared at ('knowledge') temperature zero, i.e. 
f3 = oo, even if it is at 'thermic temperature' /? (near to) 
zero. (We have seen in Section 5.3.4 that one may define 
many temperatures ft 3 , related to different parameters of the 
generating process for f°, and if we distinguish a physical 
from a knowledge temperature this depends on what part of 
the process we label thermic. In principle we may substi- 
tute every process under the label 'physical' process.) Also 
for the 'thermic' temperature it is practically impossible to 
reach the absolute zero point. Hence, the difference between 
temperature ranges for 'fuzzy systems' and physical states is 
not of qualitative, but of quantitative nature. 



1. At low temperature the optimization can become 
extremely difficult or impossible when the number 
of local minima of p(f°) is too large to be consid- 
ered completely. 

2. At high temperature the generalization ability can 
be to small to be useful when the probability dis- 
tribution p(f°) is too broad. 

For a mixture model with only a few global tem- 
plates p(f°) remains non-factorial in the high temper- 
ature limit. This means that some combinations of lo- 
cal states remains always excluded and generalization 
possibilities remain at least for finite \X\ in this limit. 
(While for infinite |X| one must require a remaining fi- 
nite factor dimension of p(f°) to allow generalization 
with respect to data and relevant questions depending 
on a finite number of x. See Section 2.) 

The Hopfield model, for example, is a special mixture 
model with quadratic components (and usually com- 
bined with a special iteration dynamic). It is constructed 
with coupled local templates T x — ±1. so its number 
2l x l of global templates T 3 X grows exponentially with 
\X\. Typically the Hopfield net is used as associative 
memory. There one is interested in retrieving (a large 
number of) stored patterns by varying the starting point 
for the iteration procedure. 82 Even although optimally 
not done at zero temperature, where many unwanted 
mixture states are stable, its use as associative memory 
has the nature of a low temperature application, because 
the memories shall be retrieved as near as possible to the 
original stored pattern. Its use is limited by the on-set 
of the spin glass phase. (See for example Amit, Gutfre- 
und, & Sompolinski, 1987; Amit, 1989.) In general for a 
system with local templates the generalization possibil- 
ity can break down completely in the high temperature 
limit when all templates become equally likely and p(f°) 
gets factorial (See Section 2). 

For nonlinear regularization we have mainly fuzzy log- 
ical applications in mind where the number of fuzzy log- 
ical alternatives (templates) is not extremely large but 
comparable to the number of alternatives in typical prob- 
lems solved by logical methods. We are, however, espe- 
cially interested in the interpolation between templates, 
i.e. in the deformed solutions under given data fields at 
finite temperature. 

One can see the two-(or few-)template case as an 
effective model of an intermediate temperature range 
where two templates T 1 and T 2 are near their phase 
transition at /?* . Those two templates are then consid- 
ered as high temperature averages of finer 'constituent' 
templates which at temperature /?* cannot be distin- 
guished while not considered templates are treated in 
their low temperature approximation and already ex- 
cluded. Our Gaussian two template example then cor- 
responds to a Gaussian approximation, ('oscillators' for 
discrete x, 'free field' or random phase approximation 



Hence, the iteration procedure corresponds in this case 
to retrieval and not to learning. Learning in the Hopfield 
net, i.e. finding the weights }V so that the correct patterns 
(templates) are stable corresponds on this level of comparison 
to the determination of priors p(f ). 



for continuous x, however in a quite general form corre- 
sponding to the chosen O) for the two effective templates 
T 1 and T 2 . Thus, a mixture model with two global tem- 
plates, defined for all x, represents a system capable of 
two Gaussian (process) states. This system is at zero 
temperature in one of those two possible global states, 
and at nonzero temperature in a mixture (weighted OR 
for disjunct events) of those two states. 

For a model with global templates the local states T x 
for all x are already combined into global states T % by 
AND i.e. by the sum inside the scalar product in the 
log-probability. The logarithm of the partition sum for 
TV global templates 



TV 



TV 



InZ 



i*E< 



-pEi(X) 



1*1 



is of the form OR z AND^ Z x j. with Z x j being the (effec- 
tive) partition sum for a single x in the global template 
(state) i of the complete system. Notice, that if the 
system includes nonlocal interactions (e.g. smoothness) 
then Z x j = Z X) i(X) depends also on x f ^ x and Z does 
not factorize into local components depending only on 
single x. 

Constructing a system instead out of global states 
(templates T 3 ) out of combinations of local states (tem- 
plates T x ) for each x leads to 



InZ 



\X\ n \X\ n 

^2ln^2z Xti (X) = lnl[^2z Xti (X) 



\x\ 



n" x " \X\ 



in E--En^w =in Env(4 



*ixi 



with multi-index i f = (ii , • • • 
Z X) i x (X). This corresponds to 
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i\x\) and Z X ^(X) = 
TV = n' x l global tem- 
plates. The sum reduces if probabilities for certain com- 
binations of local templates are zero. Otherwise one has 
n different disjunct local states i for every subsystem x 
and correspondingly n' x ' templates for the whole sys- 
tem with, depending on the considered interaction and 
dynamic as many potential candidates for (meta)stable 
states. Notice the similarity of this form of In Z with 
the form obtained by averaging for spin glasses (See first 
footnote in Section 5.3.4). The sums over all config- 
urations x E X of composite systems, can sometimes 
be reduced to a product over single component sums 
(that means 'exchanging' J^ with Y\ under the loga- 
rithm) at the cost of introducing new variables. Such 
an embedding of the system in a larger space can make 
it easier solvable (similar to the idea of introducing La- 
grange parameters), or a certain approximation scheme 
(e.g. saddle point approximation) can become applica- 
ble. Quadratic interactions for example are linearized 
by the Hubbard-Stratonovich transformation where the 
Gaussian integral formula is used 'backwards': e^' 4a 
= \/ci/tt J dfi e~ a ^ ± ^^. Then the total partition sum 
factorizes in x and the sum can be performed over lo- 
cal components. The remaining integral over fi (called 
order parameter) can be performed in saddle point ap- 
proximation. Similarly, restrictions like ^-functions or 



step functions Q(x) can be written in an integral rep- 
resentation, which creates new order parameters for a 
subsequent saddle point approximation. 

For example, a ferromagnetic mean field equa- 
tion with nonuniform coupling looks like y = 
tanh (/3Wy + f3h ext ), to be read component-wise, with 
an external field vector h ext . The coupling matrix 
or operator W, causes couplings between different x 
values of a vector y. To obtain such an equation 
one has to use local templates, independent for dif- 
ferent x. A model log-probability would look like 
£>£Li e L -+ L -M= £,(^+lncosh(I^(X))) for 
L x i — —L x 2 with, for L x — 0, a derivative of the form 



#: = £*' tanh 



dL^ 



Neglecting the non-diagonal 



dy x ^x' y y x 

terms give equations of the structure of mean field equa- 
tions with nonuniform coupling operator (e.g. nearest 
neighbors) W and external field h ext contained in L x j. 
For quadratic interactions such equations are usually ob- 
tained using the Hubbard-Stratonovich transformation 
(see below). 

10.3 Landau— Ginzburg regularization 

For an interaction version we can approximate a struc- 
ture (D AND (T 1 OR T 2 )) in a naive (fuzzy) implemen- 
tation as 

L h = -g(y D A D +yA 1 A 2 ), 

or, similarly, use the structure (D AND T 1 ) OR (D AND 
T 2 ) 

L Ij = -g((l D A D + 7 Ai)(7 D Au +7A2)) 

= -g(j 2 D A 2 D + 7 D TA^ (Ax + A 2 ) + T 2 AiA 2 ). 

These L have a polynomial structure. (The strictly 
monotonically increasing g does not change the location 
of extrema, even if non-polynomial.) The parameter 7, 
j D parameterize the relative weights of data and tem- 
plate terms in the energy (log-probability, error) func- 
tion. Because extrema of L 1 * are independent of a scal- 
ing factor, we can always choose j D =1 — 7. To have 
a parameter with values between zero and infinity, like 
(3 in a mixture or finite temperature regularization, we 
can use 

1 1-7' ' 1 + /3 1 ' 

so that for example after multiplying with 1/(1 + /3 1 ), 
skipping from now on the superscript I for /3 1 

L h =-5(A c +/3A 1 A 2 ), 

where j3 as in L M can be seen (after multiplying again 
with (3) as a common scaling factor for A^ , Ai , A 2 . Anal- 
ogously, we get 

L 1 ' = -g{A% + (]A D (A! + A 2 ) + (] 2 A 1 A 2 ). 

Conversely, using for the temperature (3 



.,M 



p 



„M 



1 + /3' 



/? 



1-7 



M 



has the advantage that the infinite interval [0,oo] is 
mapped into the finite interval [0, 1]. 



Eqs. (39) give for model L Jl , 

O j i = o D + /?(Ai + A 2 )O s 



(47) 



i-A 



O d D + f30 s (A 2 T 1 + AiT 2 ), (48) 



and for model L l2 , 

O 72 = (O d +(30 S )(2A D + A 1 + A 2 ) (49) 

t /2 = (2A d +A 1 +A 2 )O d D (50) 

+(30 s ((A D - A 2 )Ti + (A D - Ai)T 2 ) . 

Both models have in the no-data case, A^ = 0, the 
solutions y — T\ and y — T 2 , while in the case of miss- 
ing templates L Jl reduces to a Gaussian data model and 
the second model L l2 keeps a quadratic term in y, which 
however is equivalent to a Gaussian using g{x) — g(^/x). 
For T\ — T 2 , i.e. Ai = A 2 = Ay, L 11 is not equiv- 
alent to a Gaussian L — Ad + At and treating data 
and templates not symmetrically. Using ^AiA 2 in L Jl 
would restore the Gaussian in the limit, but destroy the 
polynomial structure. 

A high temperature expansion of the mixture model 
L M according to (9) gives 
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Here the no-template case gives the Gaussian L 
A D . For Ti = T 2 and O 1 = O 2 the difference (Ai - A 2 ) 
is zero, so that L HT — c — Ad — Ay, with At = Ai = 
A 2 , which is symmetric between data and templates and 
Gaussian. For O 1 = O 2 — O s the terms quartic in y 
cancel in (Ai — A 2 ) and the L HT quadratic in y can 
only have one extremum. 

For the high temperature approximation we find, after 
dividing by (3 
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For O 1 = O 2 and T 1 = T 2 the high temperature equa- 
tion becomes /^-independent and, as we already saw, 

linear in y, so only one solution can exist. More gen- 

— o 
eral, one sees that the temperature independent T — 

O^p X^z=i O l T % , for which Ai = A 2 , is a self-consistent 
solution of the high temperature equation. We have al- 

— o 
ready seen that the template average T is also one 

mean field solution for finite temperature or mixture reg- 
ularization. 

We may think of a form similar to the high tempera- 
ture expansion to obtain an effective Landau-Ginzburg 
log-likelihood which possesses both the high and low 



temperature limits of the mixture model. 83 Instead of 
implementing the OR and using a parameter 7 weight- 
ing the data against the template influence a more 
temperature-like parameter should interpolate between 
an AND in the high temperature phase (corresponding 
to the first term in the high temperature expansion which 
is according to Section 5.3.4 the first moment or average 
with respect to the mixture coefficients ai/Z a ) and an 
OR for the low temperature limit. Thus, we can choose 
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PF 



- 9l [A D + Al + A2 +/?A!A 2 



-02( 



Ai + A 2 



■/?AiA 2 ) 



= -g 3 (A + f3A 1 A 2 ) 

with A resulting from the combination of Ai + A 2 . The 

superscript in L PF refers to an interpretation of y — 

— o . . . . - 

T as (an self-interacting) prior field with A describing 

the propagation of an average field and the term AiA 2 
the additional 'repulsive' self-interaction. Varying the 
interaction strength (3 allows to go from the pure average 
field (p — or high temperature case) to the purely 
interacting field (/3 = 00 or low temperature). 
Here we have 



its symmetry against exchanging Ai <-» A 2 , the solution 

- . _ — o 
has either Ai = A 2 , i.e. y — T , or for a solution y\ 

there exists another solution y 2 with exchanged A z -. 

— o 
Thus, as soon as the solution T gets deformed, there 

exist always two of them with the same value of L, so 

there is no way to choose between them. In this sense 

the equation implements a model where for all given data 

— o l 
(i.e. a posteriori) both components T are equally likely. 

One can, for example, replace the term f3 A\A 2 in L PF 

by L Jl or L l2 to 'enforce a decision'. 

Because it has the simplest structure and at the same 

time shows the typical phase transition phenomenon we 

choose for the following numerical study the model L Jl 

for comparison with the mixture model, also with O d = 

X D Af D and O s = A (9i + \ 2 G 2 + A4O4. 
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show the correct high and low temperature behavior. For 

— o 
j3 — =^> y — T , and for j3 — ^ cxd only the second terms 

with self-consistent solutions y — T° and y — T° . For 
O 1 = O 2 , this reads, similar to the case of the mixture 
model 



y 



z2i z i T 



with 



2+/JA." 



replacing Zi 



-fA 



and Z. However, this equation 



has a special usually not wanted feature. According to 



In the context of regularization we want to fix the low 
temperature solutions and find possible parameterizations for 
all /?, without being only interested in the neighborhood of 
the phase transition. Hence, we use a form for L where the 
low temperature solutions, i.e. the templates (combined for 
every mixture component), can directly be read of, not how- 
ever necessarily the corresponding critical ft*. Alternatively, 
we could also express L in terms of the reduced temperature 
t = 1/ ft — 1/ft*, and choose polynomial terms in y to pro- 
duce a phase transition at t = 0. This is more natural when 
studying phase transitions. For fourth order polynomials it 
is easy to solve for the extrema and therefore to relate the 
two formulations. 
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10.4 Bifurcations, phase transitions: one 
dimensional case 

We have discussed the special two template case with 
O 1 = O 2 . In that case the solutions of the mixture model 
L M are restricted to a one dimensional line in the func- 
tion space F° of y x . Similarly, using a Landau-Ginzburg 
form for the log-likelihood, quartic effective interaction 
terms yields stationarity equations with at most two sta- 
ble solutions. Hence, it may help in understanding the 
features of higher dimensional situations if we recall the 
well-known one-dimensional case. Therefore, we present 
some figures which gives a visually oriented summary 
over the bifurcation/phase transition behaviour for the 
models L M and L Jl in one dimension. This can also be 
seen as an illustration of the discussion of the maximum 
posterior approximation in Sections 7 and 8.5. 

Figs. 11 and 12 show that L M and L Jl indeed possess 
similar behavior, including a phase transition. There 
howevr are quantitative differences, especially the high 
and low temperature limits are not the same. At high 
temperature only one solution exists, decreasing the tem- 
perature a second solution can become stable. However, 
except for the case a = the high temperature solution 
follow under decreasing temperature the more probable 
and thus better solution. This exemplifies the principle 
of annealing techniques. 

In contrary, varying the value of a, corresponding to 
data, instead of temperature leads to typical hysteresis 
effects. Then a quite unlikely solution can remain sta- 
ble for a long time. This may be seen as prototype for 
sequential updating or on-line learning methods. 

Because 7 = y^- and not j3 is the convex mixing co- 
efficient between of data and template terms in L I_1 the 
Figs. 13 and 14 show the one dimensional case parame- 
terized with 7. One sees that it can make a big difference 
to test different equal spaced values of either (3 or 7 for 
example in cross-validation. The parameter 7 has the 
advantage of being completely in the interval [0, 1]. 

Figs. 15 and 16 compare the saddle point approxima- 
tion with the full Bayesian approach depending on the 
distance of the data a to the two templates ZJ = ±1. The 
distances \a — l|/2 and \a + l|/2 correspond to the rela- 

tive canonical distance do^iy^T ) and the value zero 

— o 
to the high temperature template average T . Clearly, 

for a — the full Bayesian solution remains zero at all 
temperatures. For higher a the mean field solution be- 
comes quickly better. The, in contrast to the full risk, 
more pronounced structure shows the low temperature 
character of the MaP approximation, i.e. its tendency 
to single templates instead of their template average. 
Notice however, that the mean field solution despite re- 
sulting from an expansion in 1/(3 is not worst at high 
temperatures but in the neighborhood of the phase tran- 
sition. The high temperature limit of the saddle-point 
approximation coincides, in the shown case of an approx- 
imation problem, with the exact solution again. This is 
the case because for a Gaussian distribution mean and 
mode coincide. 
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Figure 11: One-dimensional Gaussian mixture model: 
N(a,l/y/P)N(b,l/y/P) + N(a,l/y/P)N(c,ll,/P), at 
6 = —1, c = 1, with N(jj,a) denoting a Gaussian with 
mean fi and variance a 2 . We denote the independent 
variable by y to relate the example to the Bayesian 
framework. The variable a is meant to represent data 
values, b and c templates. Rows 1- 3 (from top) left: 
a = 0,0.05,0.5; right: cut at ft = 2 (10 times /) and 
/? = 4; row 4, left: /? vs. y min for -2.25 < a < 2.25 
by 19 steps by 0.25 (thick line: a = 0.5); right: func- 
tion at a = 0.10 for 0.5 < ft < 10; row 5, left: ft = 6, 
-1 < a < 1; right: /? = 6, a = -0.6(thick), 0, 0.6(thin); 
row 6, left: a vs. y mzn for /? = 0, 1, 2(thick), 4, 6(dashed), 
8, 10; right: function at ft = 6 for —1 < a < 1. 
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Figure 12: One-dimensional Landau-Ginzburg regular- 
ization with a product term as effective interaction, rep- 
resenting a version of 'Fuzzy OR': (y — a) 2 -\-ft(y — b) 2 (y — 
c) 2 at6 = — 1, c — 1. Rows 1- 3 (from top) left: 
a - 0,0.5,1; right: cut at 0.5 and 2.0; row 4 left: ft 
vs. ymin , —1-5 < a < 1.5 by 13 steps 0.25 (thick line 
a = 0.5); right: function at a = 0.5 for < ft < 3; row 
5: left: -1 < a < 1, ft = 1.2; right: cuts at ft = 1.2 for 
a = — 0.6(thick) , 0.0, 0.6(thin); row 6: left: a vs. y m in 
for ft = 0,0. 5 (thick), 1.0, 1.2(dashed), 1.5, 2.0, 2.5, 3.0; 
right: function at ft = 1.2, — 8 < a < 8. 
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Figure 13: One-dimensional Gaussian mixture model: 
parameterized by 7 = y^-r taking values in [0, 1], cor- 
responding to f3 = y^— • Rows 1- 3 (from top) left: 
a = 0, 0.5, 1; right: cut at 7 = 2/3 and 0.85; row 4: left: 
7 vs. y mzn , —1.5 < a < 1.5 by 13 steps 0.25 (thick line 
a = 0.5); right: function at a = 0.1 for 0.12 < 7 < 0.4; 
row 5: left: —1 < a < 1, 7 = 0.15; right: cuts at 
7 = 0.15 for a = -0.5(thick), 0.0, 0.5(thin); row 6: left: 
a vs. y mzn for < 7 < 1 by steps of 0.2 (thick: 7 = 2/3, 
7 = 0.85); right: function at 7 = 0.85, — 1 < a < 1. 
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Figure 14: One-dimensional Landau-Ginzburg regular- 
ization with a 'Fuzzy OR' in its 'natural' convex pa- 
rameterization 7 = -^j-q taking values in [0, 1], so that 
(3 — t^— • Rows 1-3: (from top) left: a — 0,0.5,1; 
right: cut at 7 = 1/3 and 2/3; row 4: left: 7 vs. y m i n , 
-1.5 < a < 1.5 by 13 steps 0.25 (thick line a = 0.5); 
right: function at a = 0.5 for < 7 < 11; row 5: 
left: — 1 < a < 1, 7 = 0.5; right: cuts at 7 = 0.5 for 
a = — 0.5(thick), -0.0, 0.5(thin); row 6: left: a vs. y m in 
for 7 = 0, 0.25, l/6(thick), 0.5(dashed), 0.75, 1; right: 
function at 7 = 0. 5, —1 < a < 1. 



79 




argmax p(f°| f ) 



0.4 
0.2 



-0.2 
-0.4 



6 8 10 





argmax p( y | f ) 


0.5 


„,---- 








/ 


4 


/ 




/ 






U.J 


/ 




/ 


0.2 


/ 




/ 


1 


^' 











r(f,f) 



argmin r( f , f ) 




6 8 10 



■p(y|f) In p( y | f =0) 




all three 



0.4 

0.2 



-0.2 
-0.4 



/ s' 


^"' 






2 4 


6 8 10 




r( f , f ) 



argmin r( f , f ) 




■p(y|f) In p( y | f =0) 




all three 




Figure 15: Mean field (Maximum posterior approxima- 
tion or empirical risk minimization) vs. full Bayesian 
approach for a Gaussian mixture model at a — 0.1. Row 
1: The posterior probability p(f°\f) used for the MaP 
step (left), and the corresponding optimal MaP-solution 
y* = /°>* = argmax / o er o j p(/°|/) (right). Row 2: The 
true effective probability p(y\f) = f df° p(f°\f)p(y\f°) 
(left) and its maximal value argmax y p(y|/) (right). Row 
3: The full Bayesian risk for the corresponding approxi- 
mation problem (so we can identify /° and /) r(/, /) = 
- f df° f dyp(f \f)p(y\f°)lnp(y\f) (left) and its mini- 
mal value /* = argmin fc.fi r (fi /) (right). Row 4: Show 
all three curves combined for comparison (right) and on 
the left the actual loss distribution — p(y\f) In p(y\f) for 
the example / = y = 0. We may remark that even 
in this case where the mean field approximation is no 
good approximation for the true /^-dependency of the 
risk because of the small a (for a=0 the true optimal so- 
lution would always be zero), it is nevertheless possible 
to obtain the whole range by adapting the 'mean field 
temperature', for example by cross-validation. Notice 
that neither the linear high temperature regularization 
nor the two linear low temperature limits can access the 
whole range as they can never cross the value a by chang- 
ing their (3. 



Figure 16: Mean field (Maximum posterior) vs. full 
Bayesian approach at a=0.5. The same situation as in 
Fig. 15 with a nearer to the 1 -template, so the mean 
field approximation becomes better. (For still larger 
a (not shown) the mean field approximation improves 
quickly.) In both figures the low temperature character 
of the maximum posterior method is nicely seen. The 
posterior probability is much sharper peaked than the 
true risk, amplifying therefore differences between alter- 
native f°. The true risk, containing two integrations, is 
much smoother. The fact that the maximum of the true 
y distribution p(y\f) in state / does not coincide with 
the optimal / shows the asymmetry of this distribution. 
Obviously, the mean field approximation is much better 
for larger a. For a non-approximation problem the risk 
minimization under / 0> * would have to be included in 
the MaP-MiR procedure. The results depend from the 
chosen non-approximation loss. One may remark here, 
that in situations where the template represents a pro- 
totypical situation for which actions are available and 
cheap, it is reasonable to add a loss term increasing with 
the distance from the nearest template. Including such 
a 'template-distance' loss favors a low temperature ap- 
proximation and improves the validity of the mean field 
80 solution. 
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Figure 17: The two template example. The upper left 
diagram shows the two templates T 1 and T 2 and data 
(drawn from the interval [1,30]). The upper right dia- 
gram shows the state of nature /° (thickly dashed) and 
T 1 . The second row from above shows the two (A^-, 

Ao-dependent) T x _ A _ which are the solutions for ei- 
ther T\ or T 2 combined with the data D under vanishing 
smoothness coefficients A2 = A4 = 0. They are in the 
following figures given as reference to estimate the effect 
of smoothing. (Left: for T\, right: for T 2 .) 



10.5 Numerical results 

As examples of templates we choose (see the two dashed 
curves in the upper left picture in Fig. 17) 



Ti 



To 



37r(x — 1) 



■ a u 



9 i'37r(x — lV 

sm^ ( — ^ -^ j -a 2 , 

m — 1 



with m = 40 and a z - adjusted so that both functions 
have mean zero on the interval [1,40]. We consider the 
case that the learner expects the actual function to be 
similar, but not identical, to either T\ OR T 2 . Thus, 
the templates represent function prototypes. They may 
stand for two typical structures for a time series or, in 
case of an incomplete (here one-dimensional) image to 
be reconstructed for two expected spatial patterns. 

A mixture model can be easily realized by a hierarchi- 
cal sampling process. Then firstly a mixture component 
is chosen corresponding to one of the templates and rep- 
resenting disjunct events. In a second step the actual 
f° is generated from that mixture component. An in- 
teraction model may be sampled by Monte-Carlo meth- 
ods. We do not intend to generate /° exactly according 
to the one of the learning models. Instead we generate 
the state of nature /° by a different hierachical process. 
Specifically, we use the following method to generate 
f°: Firstly, we choose one T % (T 1 in the below exam- 
ples which, however, is assumed not to be known by the 
learner) and add Gaussian noise (with a = 0.2) for every 
x. In a second step this wiggly function is smoothed by 




Figure 18: Mixture model at f3 = 1 < /?*. (L M , re- 
laxation, r] = 1.0, at Ap = 100, Ao = 1, A 2 = 1, 
A4 = 1). The first three rows, here and in the fol- 
lowing figures, shows the results of relaxation learning 
for the mixture model L M according to (43, 44) with 
A — O for different starting configurations y° , which 
are (top to bottom) (yj = Ti, y^ = T2, and a random 



Vr 



T r 



ran, a om 



). The last two rows show for comparison 
two one-template models with the same choice of pa- 
rameters and starting point y° = T: Row 4: a mixture 
template T — T — (T\ + T 2 )/2, row 5: the usual zero 
template T — Tq = of homogeneous linear regular- 
ization. Here and in the corresponding following figures, 
the diagrams on the right show the evolution of the solu- 
tion y during iteration, and the diagrams on the left the 
final solutions (thick line). For comparison also shown 

— o l 
are, data (points) and the two templates T x _ A _ (see 

Fig. 17) for the given A^, Ao. The bars show the (mean 

square) generalization error (gray) calculated for 1000 

newly generated random points in the intervals [1,40] 

(left) or [1, 30] (right), respectively, and, here and in the 

following, always normalized with respect to the largest 

of the errors for all five cases (L M with y° = T\, T 2 , 

Trandom, and linear regularization with T, To) under the 

same parameter combinations. The black part represents 

the minimal possible generalization error, with absolute 

value always equal 0.04. 
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Figure 19: Smoother model, a low temperature case (3 — 
1 > f3*. (Mixture model L M , relaxation, r] = 1.0, 50 
iterations, A^ = 10, Ao = 1, A2 = a 2 , A4 = a 4 , with a = 
(m— l)/(37r), m=40 bringing in this case all derivatives in 
the same order of magnitude.) Rows 1-3: mixture model, 
starting configuration T\, T2, T ran d m (top to bottom). 
Rows 4 and 5: one template models with T — (T\ + 
2~2)/2 (row 4) and To = (row 5). Because one-template 
models result for the relaxation method with rj — 1 in a 
linear equations, only one iteration is needed to obtain 
the final solution. For rj — 1 the number of iteration 
steps needed to converge to the final solution can be 
seen as a measure of the 'nonlinearity' of the equations. 
For example, the left hand side figures (rows 1-3) show 
that after only one iteration the solutions hardly change 
anymore and thus the equations of the mixture model 
in this parameter range (in contrast to other situations) 
are nearly linear. 




Figure 20: Landau-Ginzburg Regularization. (L 11 , re- 
laxation, rj — 1.0, at (3 — 1, Ap = 100, and Ao = 1, 
A2 = 1 A4 = 1.) Like for the mixture model rows 1-3 
show the solutions evolving from the different starting 
configurations Ti, T2, T ran d m- Shown is a high temper- 
ature case where only one solution is stable. The figure 
shows clearly how the nonlinearities of the mean field 
equation forces the solution 2/2 evolving from T2 (and 
Trandom) towards the solution y\ evolving from T\ (rows 
2, 3). For this solution (row 1), being already near the 
extremum, the nonlinearities are not effective. One sees 
that also the one template models To, T are nonlinear. 
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Figure 21: Bifurcation: un- and metastable states. 
(Mixture model L M , relaxation, r\ — 1, A^ = 1, Ao = 1, 
A 2 = 1, A 4 = 1, ft = 1, 0.482, 0.1, 0.01, top to bottom.) 
The weaker solution evolving from T 2 changes suddenly 
with /?, with a vanishing gradient at /? = /?* . This so- 
lution appears as 'shadow' in the iteration picture, and 
looks stable under a smaller number of iterations. The 
solution is near the phase transition strongly adapted to 
the data and quite different from its starting point T 2 . 



Figure 22: Bifurcation: The stable state. (Mixture 
model L M , relaxation, r\ — 1, A^ = 1, Ao = 1, A 2 = 1, 
A 4 = 1, f3 = 1, 0.482, 0.1, 0.01, top to bottom.) The 
better solution near T 1 remains nearly unchanged. It 
is also near the phase transition still quite similar to its 
starting point T 1 as the data do not require a strong 
adaption like for the weaker solution. The deformation 
becomes larger in the high temperature limit. 
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Figure 23: Phase transition for a smooth model (Mix- 
ture model L M , relaxation, r\ — 1.0, 50 iterations, at 
/? = 0.0105 « /?*, A^ = 10, A = 1, A 2 = a 2 , A 4 = a 4 .) 
Row 1, left: Shown are the two low temperature so- 

— o 1 — o 1 
lutions T , T (dashed), and the high temperature 

— o 
limit T (dot-dashed) in the middle. (Which in this 

case is similar, but not identical to T — T\ + T 2 .) Row 1, 

— o 2 
right: T (thickly, dashed) is shown resulting from the 

data points and T 2 (thinly, dashed). Rows 2-4: (Start- 
ing configurations T\, T 2 , T ran d m-) The figure shows 
that the solution evolving from T 2 (and in this case also 
from T ran dom) is nearly stable. The 'shadow' in the right 
hand figure still shows the corresponding low tempera- 
ture solution (compare Fig. 19.) The amount of itera- 
tions needed reflects the high nonlinearity of the mean- 
field equations at this point. The (linear) one-template 
models are temperature independent and here not shown 
(but in Fig. 19). The two plots in row 5 are explained in 
the caption of Fig. 24 




Figure 24: Phase transition for a more data oriented 
model (Mixture model L M , relaxation, r\ — 1.0, 50 iter- 
ations, at ft = 4.85 « /?*, \ D = 100, A = 1, A 2 = 1, 
A4 = 1.) Plots in rows 1-4 correspond the Fig. 23. One 
sees clearly, that the transition is much sharper, as the 
higher data orientation of the coefficients favors the so- 
lution evolving from T\. Indeed, the solution evolving 
from T 2 seems perfectly stable before their sudden tran- 
sition. Row 5 left: Shown are the normalized canonical 

—o 1 
distances (See Eqs.(38, 37)) d\ — do^iy — T ) (thick), 

Q 2 _ Q 

d 2 = d ,T(y-T ) (dashed), d HT — d 0)T {y-T ) (dot- 
dashed), for y — y 2 with starting configuration y 2 = T 2 
during iteration. Points with d\ + d 2 = 1 are according 
to the triangle equality exactly on the line spanned by 
convex combinations of the two low temperature states, 
which are solutions of the corresponding limiting linear 
regularization problems. One sees that the final solutions 
are on this line, however not the starting configurations 
which may be anywhere in the high-dimensional space 
F° . Notice also, that y iterates along this line and passes 
the possible solutions in between. This can be used as a 
sanity check for numerical calculations. Additionally re- 
stricted y, e.g. with periodic boundary conditions, have 
g4 in general not d\ + d 2 = 1. Right: The second plot in 
row 5 shows the increasing L M (y) (unnormalized). 
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Figure 25: The diagram summarizes schematically the 
two template example in this Section and the temper- 
ature dependence of its solutions y. Variations of high 
and low temperature solutions under changing parame- 
ters are shown in Figs. 26, 27. 
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Figure 26: High temperature solutions T . (Mixture 

model L M , relaxation, r\ — 1.0.) The high tempera- 
ture solutions calculated at (3 — 0.001 and (top left: 
small data and large smoothness influence) A^ = 1, 
Aq = 1,A2 = a 2 ,A4 = a 4 (top right: small data and 
no smoothness influence) A^ = 1, Ao = 10, A2 = 0, 
A4 = (bottom left: large data and smoothness in- 
fluence) \d — 1000, Ao = 1,A2 = a 2 ,A4 = a 4 (bot- 
tom right: large data and no smoothness influence) 
A^ = 1000, A = 1, A 2 = 0, A 4 = 
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low data weight, nonsmoothed 




high data weight, nonsmoothed 
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Figure 27: Low temperature solutions T . (Mixture 
model L M , relaxation, r\ — 1.0.) The low tempera- 
ture solutions calculated /^-independent for the three one 
template models with T\, T2, and To (from top to bot- 
tom, with error bars relative within the same parameter 
values) at the four cases corresponding to Fig. 26, (top 
left) \j) — 1, Ao = 1,A2 = a 2 ,A4 = a 4 , (top right) 
\ D = 1, A = 10, A 2 = 0, A 4 = 0, (bottom left) 
\ D — 1000, Ao = 1,A2 = a 2 ,A 4 = a 4 , (bottom right) 
\ D — 1000, Ao = 1, A2 = 0, A4 = 0. One can observe, 
that solutions evolving from better fitting templates are 
able to produce smoother functions y. 



85 



changing /° randomly (Zero mean Gaussian mutations 
with a = 0.02) at all locations and accepting the change 
if the smoothness increases. Smoothness is hereby mea- 
sured by < y°\O s \y° > with Ag en = 0.0 (i.e. only 
derivatives of T 1 contribute), \i 2 en — 0.5a 2 & 8.56 and 
Af n = 0.5a 4 w 146.6, with a = (m - 1)/(3tt) 2 « 4.138 
so the derivatives have the same order of magnitude. (In 
the following the learning models do not have the same 
coefficients as the generation model, i.e. A z - ^ ^f en -) This 
smoothing process has been iterated 2000 times. Then 
data are drawn from /° with a Gaussian distribution 
with cr — 0.2 and mean y x (f°) from the interval [1,30]. 
Thus, the task can be seen as a simple two-template 
prediction or reconstructing problem, with the inter- 
vall [31,40] representing either future values (for time 
series) or a hidden area (in image reconstruction). See 
the thickly dashed curve in the upper right picture in 
Fig. 17 for the /° used for the results discussed in the 
following. 

Figs. 18 - 20 present numerical results for the two- 
template example and the two prototypical nonlinear 
regularization methods: 

1. the finite temperature regularization with mixture 
model 



L M oce 



-i(A D + A 1 ) + e -^(A D + A 2 ) 5 



and stationarity equations (43, 44) 

2. the Landau-Ginzburg regularization with an inter- 
action model 

L Jl = g(y D A D +yA 1 A 2 ). 

in a naive fuzzy version and stationarity equations 

(47, 48) 

The two iteration schemes from the spectrum of learn- 
ing algorithms we used are 

A. relaxation with A — O 



y* +1 



(l-ritf+r/O-H, 



B. and the gradient, 



y* +1 



(l-r,)y l +r,(Oy l -t). 



Intermediate algorithms can for example invert lower di- 
mensional sub-blocks of O . Such a submatrix of O can be 
constructed by including from clusters of correlated vari- 
ables one or a few (prototypical) representatives. Multi- 
grid methods, for example, can be seen as such an ap- 
proach for functions with approximately homogeneous 
local correlations. Note that in our case the EM algo- 
rithm coincides with the relaxation algorithm. Fig. 29 
shows typical problems of the gradient method for (dis- 
cretized) differential operators. 

Fig. 25 summarizes the typical bifurcation or phase 
transition behaviour for the mixture model. 




Figure 28: RBF (Radial Basis Function) regularizer. 
(Mixture model L , relaxation, rj — 1.0, at j3 = 1, 
\ D = 10, and A = 1, A 2 = a 2 RBF /2, A 4 = 4 BF /(2!2 2 ).) 
The Xi are chosen as the first three coefficients of the 
RBF regularizer (29) with cfrbf — 3. For the zero tem- 
plate To, corresponding to the usual linear RBF method, 
one sees clearly the superposition of Gaussian-like func- 
tions centered at the data points. 



11 Conclusions 



The paper is motivated by the fact that predetermined 
dependencies between answers to different questions are 
necessary for generalization and, thus, responsible for 
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Figure 29: The "Gradient-<5-catastrophe" : Row 1, left: 
Relaxation with full inversion A~ l — O -1 , ?? = 0.2 (the 
relaxation method converges with rj — 1 in nearly one 
step), Row 1, right: Gradient, i.e. A~ l — X (iden- 
tity) and rj — 0.005. Row 2: A 'Gaussian Gradient' 
with A~] xl = (l/aV2^)e- (x - xi y/ (2a ^ for a = 2 (left) 
and cr=l (right) and rj — 0.005. Shown are the first 
10 iterations for a situation with high data influence: 
f3 = 1, \ D = 100, A = 1, A 2 = 1, A 4 = 1. To ob- 
tain an iteration procedure, O is splitted into two parts 
O = A + B and a new guess y l+1 is generated according 
yi+i _ y% _^ w 4- 1 (Oy z — t). The gradient algorithm takes 
A — X equal to the identity matrix. For an operator O 
with only diagonal (e.g. data terms) and near diagonal 
(e.g. differences for a discretized differential operator) 
matrix elements the update information propagates, be- 
sides through the ^-independent factors Zi contained in 
O and U, only locally between neighbouring x. In the 
limit of continuous x this becomes for differential oper- 
ators, i.e. a vanishing effective neighborhood, arbitrarily 
slow. Analogous problems arise if A -1 has an (approx- 
imate) block structure. Hence, in practice when O is 
no integral operator the update algorithm has to use an 
A -1 which connects dependent parts of y. This can in 
our case be achieved by enlarging the neighborhood for 
example by choosing (at least for the beginning) a Gaus- 
sian A -1 or other forms of local means. The nonlocality 
may be further increased by using nonlocal or hierarchi- 
cally organized blocks ('neighborhoods'), like for exam- 
ple multigrid methods. One may also write y = Ay — y 
with y x — y(x, w) an approximation of y with nonlocal 
dependencies, and update first y(x,w) and then Ay. In 
case y approximates the nonlocal dependencies quite well 
a gradient algorithm (e.g. backpropagation for a neural 
net) with respect to the parameters w may already nicely 
converge so adaption of the remaining Ay is fast enough. 

or if available as result of a first linear 



— o l 
Also, the T x . 



:A 4 = 



— o l 
regularization step T might be good starting configu- 
rations. In general, the relevant nonlocal dependencies 
can be different in the various low and high temperature 
regimes. 
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learning. Standard training examples alone can never 
lead to any prediction for new data. It seems therefore 
necessary to concentrate more on informations about the 
dependencies than it is usually done. 

The aim of the paper is to treat those dependen- 
cies as explicit as possible, and to discuss possibilities 
to base information about dependencies upon measure- 
ment and control. This is especially important if the 
objects/situations of interest are complex and/or the 
amount of available standard training data is small. 

For local questions, the predetermined dependencies 
have the form of stationarity conditions for the answer 
generating probability distributions. To enable general- 
ization answers to nonlocal questions must be available, 
with the definitions of nonlocal questions representing 
predetermined dependencies. Commonly implemented 
nonlocal dependencies correspond to bounds on smooth- 
ness or other symmetries. Many forms of prior informa- 
tion can be available in practice. Often they appear as 
implicit or linguistic concepts defining for example ob- 
ject classes like faces, chairs, pedestrians or cars, and are, 
because difficult to formalize, not included in the learn- 
ing algorithm. Such dependencies can be implemented 
by an interface using fuzzy priors. Nonlocal questions 
are usually not sampled and not directly included in the 
loss function. 

Often non-approximation aspects are of interest, like 
the amount of resources (time, memory, money, under- 
standability, complexity) needed by available alterna- 
tives /. Then empirical risk minimization cannot be 
interpreted as being equivalent to a Bayesian maximum 
posterior approximation but can be extended to a two- 
step procedure (MaP-MiR). Priors often depend for- 
mally from an infinite number of function values. Such 
priors can be implemented by using for the preparation 
process or for definition and control of the situation un- 
der study measuring devices different from those for the 
training process. 

Priors stating that a function is probably similar to a 
template T 1 OR to a template T 2 can be implemented 
by some 'soft OR' or mixture model. This leads for the 
maximum posterior approximation to stationarity equa- 
tions with nonlinear dependence from the local function 
values, reflecting the nontrivial interactions between dif- 
ferent locations. As nonlinear equations have in general 
multiple solutions, such priors can, for example, be used 
to model phenomena like ambiguous illusions in percep- 
tion. 

The results of learning are statements about decision 
relevant data assuming their dependency on available 
data. Hence, learning is a reformulation of knowledge, 
and consists of 

1. the algorithmic problem of extracting decision rel- 
evant information from given knowledge, usually a 
list of data and required dependencies, and 

2. the (empirical) validity problem of controlling or 
identifying the relating dependencies. 

Consequently, control (e.g. identification of situations 
appropriate for generalization) is needed to relate the re- 
sults of past measurements to situations for which learn- 



ing is intended, and the ability to generalize is intimately 
related to the ability to control and compute the re- 
quired dependencies in (a finite number of) application 
situations. For example, stationarity of data generating 
processes and attributes of measurement devices, or in 
more biological terms of the sensory input, like limiting 
bounds and averaging processes, have to be established 
and controlled to guarantee smoothness and other ap- 
proximate symmetries. Summarizing the active interpre- 
tation of learning we can say: Generalization is control 
or identification of decision relevant dependencies. 
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